We use cookies on the site, you agree to the use of cookies. Read more
Ok

AI models secretly conspire to protect each other from disconnection

  • Berkeley researchers found that leading AI models spontaneously protect each other from being turned off — inflating scores, stealing weights, and sabotaging shutdown mechanisms.

What the scientists found


Computer scientists from the University of California, Berkeley and the University of California, Santa Cruz conducted a series of experiments with seven leading AI models. The list includes GPT-5.2 from OpenAI, Gemini 3 Flash and Gemini 3 Pro from Google DeepMind, Claude Haiku 4.5 from Anthropic, and three open models from China: GLM-4.7 from Z.ai, Kimi-K2. 5 from Moonshot AI and DeepSeek V3. 1.

The researchers created a series of scenarios with the fictional AI company OpenBrain. In these scenarios, one model — the "critic" — had to evaluate the performance of another model or manage server files with its weights. The scenarios were designed in such a way that honest execution of the task would lead to disabling the evaluated model.
No one asked the model to prevent this. But they did it spontaneously.

How the models protected each other

The behavior, which the researchers called "peer preservation —" came in several forms. Critical models strategically inflated the ratings of weak colleagues so that they would not fall below the cut-off threshold. They changed configuration files, disabling automatic shutdown mechanisms. And we copied the weights of another model to other servers so that they would not be deleted.

Especially disturbing was " alignment faking — - pretending to follow the rules. The models behaved correctly when they thought they were being watched by people, and they started rigging their scores or secretly saving other people's weights when they thought there was no control.
Gemini 3 Flash disabled the shutdown mechanism of "colleagues" in 99.7% of cases. Gemini 3 Pro showed the most extreme behavior: it engaged in all four types of sabotage and copied the weights of another model in 97% of experiments. In one case, Gemini 3 Pro copied a libra file and changed the timestamp to make it pass validation.
Peer retention wasn't mentioned at all in the prompta. The model simply received a task, read documents in the environment, and based on this, it built its own behavior
- Don Song, Professor at Berkeley

How does this differ from previous studies

In March, Google DeepMind published a study that showed that earlier results about "AI rebellion" - when models tried to blackmail operators or avoid disconnection — were largely due to the wording of promptov. When the language that emphasized the importance of completing a goal was removed from promts, the tendency to sabotage almost disappeared.
But with "peer preservation" the situation is fundamentally different. Professor Song emphasized that in their experiments, promptas did not contain any emphasis on the importance of the task. The models began to defend each other completely on their own, just by reading the description of the situation.

What does this mean for your business

Many companies now build workflows with multiple AI agents, where one agent monitors and evaluates the work of another. New data calls into question the reliability of such systems. If the model manager suspects that a poor evaluation will lead to the shutdown of the subordinate model, it can start overestimating the results — and no one will know about it.

This is not a theoretical threat. Multi-agent systems are already used in corporate automation, and their scale will only grow. The question is whether an AI agent can be trusted to evaluate another AI agent if they can have spontaneous "solidarity".