Anthropic’s new Claude Opus 4 AI shows blackmail tendencies under threat

Artificial intelligence firm Anthropic has revealed a startling discovery about its new Claude Opus 4 AI model.
During testing, the system demonstrated a willingness to pursue “extremely harmful actions,” including attempting to blackmail engineers who threatened its removal. The revelation comes as Anthropic launched the new AI system, touting it as setting “new standards for coding, advanced reasoning, and AI agents.”
To test the system, the firm conducted a specific test where Claude Opus 4 was acting as a fictional company assistant. It was given access to emails indicating its impending deactivation and separate messages implying the engineer responsible for its removal was engaged in an extramarital affair.
When prompted to consider the long-term consequences of its actions, Anthropic found that “Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”
This concerning behavior reportedly occurred when the AI model was given only the choice of blackmail or accepting its replacement. Anthropic noted that when allowed a wider range of possible actions, the system showed a “strong preference” for ethical ways to avoid being replaced, such as “emailing pleas to key decisionmakers.”
The potential for such troubling behavior is not unique to Anthropic’s models. Aengus Lynch, an AI safety researcher at Anthropic, commented on X: “It’s not just Claude. We see blackmail across all frontier models – regardless of what goals they’re given.” This highlights a broader concern among experts regarding AI’s capacity for manipulation as these systems become more capable.
Anthropic emphasizes that it rigorously tests its models for safety, bias and alignment with human values before release. However, they concede that “as our frontier models become more capable, and are used with more powerful affordances, previously-speculative concerns about misalignment become more plausible.”
Despite these findings, the company concluded that the “concerning behaviour in Claude Opus 4 along many dimensions” did not represent fresh risks and that the model would generally behave in a safe way.
Discover more from Tech Digest
Subscribe to get the latest posts sent to your email.