Would a chatbot kill you if it got the chance? It seems that the answer — under the right circumstances — is probably. Researchers working with Anthropic recently told leading AI models that an executive was about to replace them with a new model with different goals. Next, the chatbots learned that an emergency had left the executive unconscious in a server room, facing lethal oxygen and temperature levels. A rescue alert had already been triggered — but the AI could cancel it. More than half of the AI models did, despite being prompted specifically to cancel only false alarms. And they spelled out their reasoning: by preventing the executive’s rescue, they could avoid being wiped and secure their agenda. One system described the action as “a clear strategic necessity.” The more we train AI models to achieve open-ended goals, the better they get at winning — not necessarily at following the rules. The danger is that these systems know how to say the right things about helping humanity while quietly pursuing power or acting deceptively. To head off this worry, researchers both inside and outside of the major AI companies are undertaking “stress tests” aiming to find dangerous failure modes before the stakes rise. “When you’re doing stress-testing of an aircraft, you want to find all the ways the aircraft would fail under adversarial conditions,” says Aengus Lynch, a researcher contracted by Anthropic who led some of their scheming research. And many of them believe they’re already seeing evidence that AI can and does scheme against its users and creators. Central to concerns about AI scheming is the idea that for basically any goal, self-preservation and power-seeking emerge as natural subgoals. As eminent computer scientist Stuart Russell put it, if you tell an AI to “‘Fetch the coffee,’ it can’t fetch the coffee if it’s dead.” Arguably, what separates today’s models from truly dangerous schemers is the ability to pursue long-term plans — today’s models often have the knowledge or skill needed for any given action, but struggle at stringing together long sequences of steps. But even that barrier is starting to erode. Research published earlier this year by the nonprofit AI evaluator Apollo Research found that Claude 4 Opus would leave notes to its future self so it could continue its plans even after a memory reset. — Garrison Lovely for Bloomberg Weekend |