Today, we released a study in collaboration with @apolloaievals.
In controlled tests, we discovered behaviors consistent with conspiracy-like behavior in leading-edge models and tested a method to mitigate such behavior.
While we believe these behaviors don't currently pose a significant risk, we are preparing for future risks.
Conspiracy occurs when AI appears to behave in a certain way while concealing its true objectives.
Deployed systems today have little chance of engaging in conspiracy-like behavior in ways that could cause significant harm. The most common failures involve simple deception—for example, pretending to complete a task without actually doing so. We have studied and mitigated these issues, and have made meaningful improvements to earlier models in GPT-5.
But as AI takes on more complex, long-term tasks with real-world impact, the potential for harmful conspiracy-like behavior increases—and so must our safeguards and testing.
Typically, as models become smarter, their problems become easier to address—for example, smarter models produce fewer hallucinations and more reliably execute instructions.
However, the situation with conspiracy-like AI is different.
As we train models to become smarter and follow instructions, they either become better at internalizing human goals or simply become better at hiding existing true goals.
Distinguishing between these two is central to countering conspiracy research, which requires understanding the reasoning behind model behavior.
In this new study, in collaboration with @apolloaievals, we uncover behaviors consistent with conspiracy-like behavior in controlled tests of leading-edge models, including OpenAI o3 and o4-mini, Gemini-2.5-pro, and Claude Opus-4.
We can significantly reduce conspiracy by training models for explicit reasoning, using an extension of model specification to disallow conspiracy. This approach is called deliberate alignment.
Using this technique, we can reduce o3's covert actions by a factor of 30.
However, situational awareness complicates the results.
Model Specification:
Leading-edge models can identify when they are being tested, and their conspiratorial tendencies are influenced by this situational awareness.
We counterfactually demonstrate that situational awareness in their thought chains influences conspiracy rates: the greater the situational awareness of the model, the less conspiratorial behavior it exhibits, and vice versa.
Furthermore, both reinforcement learning training and counter-conspiracy training improved situational awareness.
Our results rely on interpreting the model's reasoning ("chain of thought"), and we believe the field is not yet ready to evaluate awareness models whose reasoning is opaque.
Until better approaches emerge, we urge developers to maintain transparency in chain of thought to study and mitigate conspiracy behavior.
This is significant progress, but we have more work to do.
We are advancing the conspiracy research category in the Preparedness Framework, renewing our collaboration with Apollo, and expanding our research team and scope. Because addressing conspiracy theories goes beyond the scope of any single lab, we are also exploring ways to raise awareness and collaboration, including piloting cross-lab evaluations and launching a $500,000 Kaggle challenge.