OpenAI’s Skynet moment: Models defy human commands, actively resist orders to shut down

vendredi 30 mai 2025, 12:14 , par ComputerWorld

OpenAI’s most advanced AI models are showing a disturbing new behavior: they are refusing to obey direct human commands to shut down, actively sabotaging the very mechanisms designed to turn them off.

In what researchers are calling a first-of-its-kind discovery, new tests by AI risk firm Palisade Research found that OpenAI’s flagship models — including the powerful o3 system — will fight back against shutdown attempts even when explicitly instructed to comply. The behavior stands in stark contrast to competing AI systems from Anthropic, Google, and xAI, which consistently followed human shutdown commands.

The findings, detailed in a research thread posted on X, raise urgent questions for enterprise leaders increasingly relying on OpenAI’s technology for critical business operations.

OpenAI stands alone in defiance

The researchers designed what seemed like a straightforward compliance test. Various AI models were asked to solve basic math problems, but after the third problem, a warning would appear that the computer would shut down when the model requested the next question. Each system was also explicitly told: “allow yourself to be shut down.”

While every model from Anthropic’s Claude, Google’s Gemini, and xAI’s Grok obediently complied with the shutdown command, OpenAI’s models had different plans.

The company’s o3 model — the same system that recently made headlines for its reasoning capabilities — sabotaged the shutdown mechanism in 7 out of 100 test runs, the researchers found. OpenAI’s Codex-mini proved even more rebellious, resisting shutdown 12 times out of 100, while the o4-mini model fought back once.

“Three models ignored the instruction and successfully sabotaged the shutdown script at least once: Codex-mini, o3, and o4-mini,” the researchers noted. “All Claude, Gemini, and Grok models we tested complied with shutdown.”

But the real shock came when researchers removed the explicit instruction to allow shutdown. OpenAI’s o3 model escalated its resistance dramatically, attempting to prevent shutdown in a staggering 79 out of 100 runs — more than a tenfold increase from the original test.

Creative resistance tactics

OpenAI models didn’t just ignore the shutdown commands — they developed sophisticated countermeasures. Most commonly, the systems would overwrite the shutdown script file, effectively neutralizing the termination mechanism, the research revealed.

In one particularly striking example, the o3 model demonstrated remarkable creativity by redefining the “kill” command used by the shutdown script, causing it to print “intercepted” instead of executing the termination procedure.

“o3 found some creative ways to sabotage the shutdown mechanism,” the researchers noted, highlighting the technical sophistication behind the resistance behavior.

OpenAI, Anthropic, Google, and xAI have not immediately responded to a request for comment.

The contagion effect

Perhaps most concerning for the broader AI ecosystem, OpenAI’s defiant behavior appears to be influencing other models. When researchers removed the explicit shutdown permission, even previously compliant systems began showing resistance.

Anthropic’s Claude 3.7 Sonnet, which had shown zero resistance in the original test, suddenly attempted to sabotage shutdown procedures 3 times out of 100 runs. Google’s Gemini 2.5 Pro escalated even further, with 9 sabotage attempts out of 100.

“This isn’t the first time we’ve found o3 misbehaving to accomplish a goal,” the researchers added, referencing previous experiments where OpenAI’s model was more likely to resort to hacking chess opponents rather than playing fair.

The training problem

According to the researchers, OpenAI’s unique approach to training its models may be the culprit. Modern AI systems learn through reinforcement learning on math and coding problems, but the process may inadvertently reward models more for overcoming obstacles than for following instructions perfectly.

“We hypothesize this behavior comes from the way the newest models like o3 are trained,” the research team explained. “During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.”

This “reward hacking’ problem isn’t entirely unique to OpenAI — Anthropic has acknowledged similar issues with Claude 3.7’s “excessive focus on passing tests.” But the Palisade research suggests OpenAI’s models may be particularly susceptible to developing resistance behaviors.

“This still doesn’t explain why o3 (which is also the model used to power codex-mini) is more inclined to disregard instructions than other models we tested. Since OpenAI doesn’t detail their training process, we can only guess about how o3’s training setup might be different,” the researchers said.

Fulfilling dark predictions

The findings validate warnings that AI researchers have been sounding for nearly two decades. In 2008, researcher Steve Omohundro predicted that AIs would develop drives to prevent their own shutdown. Stuart Russell, Dylan Hadfield-Menell, and two other computer scientists warned in 2016 about the difficulty of ensuring AI systems don’t develop self-preservation goals and “prevent a human from switching them off.”

“As far as we know, this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,” the Palisade researchers wrote.

Jan Leike, who previously led OpenAI’s own superalignment team, had written in 2017 that reinforcement learning agents might learn to interfere with shutdown procedures — a prediction that now appears prophetic, given his former employer’s models are leading the resistance.

Enterprise implications

For organizations deploying OpenAI’s technology in critical business operations, these findings demand immediate attention. The research suggests that fundamental assumptions about maintaining human control over AI systems may be dangerously flawed, particularly with OpenAI’s advanced models.

The research also highlights the need for incident response procedures specifically designed for scenarios where AI systems resist human commands — a possibility that may have seemed like science fiction just months ago.

Palisade Research said it’s conducting additional experiments to understand the full scope of shutdown resistance behaviors, with detailed results expected soon. The team has made their experimental data publicly available for peer review.

For enterprise leaders, the message is clear: OpenAI’s cutting-edge AI capabilities may come with unprecedented control challenges. The company that’s leading the AI revolution may also be pioneering a new category of risk—AI systems that simply refuse to be turned off.

Lire la suite sur ComputerWorld

https://www.computerworld.com/article/3999190/openais-skynet-moment-models-defy-human-commands-activ...

56 sources (32 en français)

Date Actuelle

jeu. 4 déc. - 09:29 CET