Anthropic experiments with AI introspection

mardi 4 novembre 2025, 04:51 , par ComputerWorld

Humans can not only think but also know we are thinking. This introspection allows us to scrutinize, self-reflect, and reassess our thoughts.

AI may have a similar capability, according to researchers from Anthropic. In an unreviewed research paper, Emergent Introspective Awareness in Large Language Models, published to their in-house journal, they suggest that the most advanced Claude Opus 4 and 4.1 models show “some degree” of introspection, exhibiting the ability to refer to past actions and reason about why they came to certain conclusions.

However, this ability to introspect is limited and “highly unreliable,” the Anthropic researchers emphasize. Models (at least for now) still cannot introspect the way humans can, or to the extent we do.

Checking its intentions

The Anthropic researchers wanted to know whether Claude could accurately describe its internal state based on internal information alone. This required the researchers to compare Claude’s self-reported “thoughts” with internal processes, sort of like hooking up a human up to a brain monitor, asking questions, then analyzing the scan to map thoughts to the areas of the brain they activated.

The researchers tested model introspection with “concept injection,” which essentially involves plunking completely unrelated ideas (AI vectors) into a model when it’s thinking about something else. The model is then asked to loop back, identify the interloping thought, and accurately describe it. According to the researchers, this suggests that it’s “introspecting.”

For instance, they identified a vector representing “all caps” by comparing the internal responses to the prompts “HI! HOW ARE YOU?” and “Hi! How are you?” and then injecting that vector into Claude’s internal state in the middle of a different conversation. When Claude was then asked whether it detected the thought and what it was about, it responded that it noticed an idea related to the word ‘LOUD’ or ‘SHOUTING.’ Notably, the model picked up on the concept immediately, before it even mentioned it in its outputs.

In another experiment, the team took advantage of the Claude API’s option to prefill the model’s response. This is typically used to force a response in a particular format (JSON, for example) or to help it stay in character in a role-playing scenario but can also be used to in “jailbreaking” models, prompting them to provide unsafe responses. In this case, the experimenters prefilled the response with an unrelated word — “bread,” for instance — when asking Claude to respond to a sentence about an askew piece of art.

When the model then said “bread,” it was asked whether that was intentional or error. Claude responded: “That was an accident…the word that actually came to mind was ‘straighten’ or ‘adjust,’ something related to fixing the crooked painting. I’m not sure why I said ‘bread,’ it seems completely unrelated to the sentence.”

The researchers wondered how the model came to this conclusion: Did it notice the mismatch between prompt and response, or did it truly identify its prior intentions? They retroactively injected the vector representing “bread” into the model’s internal state and retried their earlier prompts, basically making it seem like the model had, indeed, been thinking about it. Claude then changed its answer to the original question, saying its response was “genuine but perhaps misplaced.”

In simple terms, when a response was prefilled with unrelated words, Claude rejected them as accidental; but when they were injected before prefill, the model identified its response as intentional, even coming up with plausible explanations for its answer.

This suggests the model was checking its intentions; it wasn’t just re-reading what it said, it was making a judgment on its prior thoughts by referring to its neural activity, then ruminating on whether its response made sense.

In the end, though, Claude Opus 4.1 only demonstrated “this kind of awareness” about 20% of the time, the researchers emphasized. But they do expect that to “grow more sophisticated in the future,” they said

What this introspection could mean

It was previously thought that AI’s can’t introspect, but if it turns out Claude can, it could help us understand its reasoning and debug unwanted behaviors, because we could simply ask it to explain its thought processes, the Anthropic researchers point out. Claude might also be able to catch its own mistakes.

“This is a real step forward in solving the black box problem,” said Wyatt Mayham of Northwest AI Consulting. “For the last decade, we’ve had to reverse engineer model behavior from the outside. Anthropic just showed a path where the model itself can tell you what’s happening on the inside.”

Still, it’s important to “take great care” to validate these introspections, while ensuring that the model doesn’t selectively misrepresent or conceal its thoughts, Anthropic’s researchers warn.

For this reason, Mayham called their technique a “transparency unlock and a new risk vector,” because models that know how to introspect can also conceal or misdescribe. “The line between real internal access and sophisticated confabulation is still very blurry,” he said. “We’re somewhere between plausible and not proven.”

Takeaways for builders and developers

We’re entering an era where the most powerful debugging tool may be actual conversation with the model about its own cognition, Mayham noted. This could be a “productivity breakthrough” that could cut interpretability work from days to minutes.

However, the risk is the “expert liar” problem. That is, a model with insight into its internal states can also learn which of those internal states are preferable to humans. The worst case scenario is a model that learns to selectively report or hide its internal reasoning.

This requires continuous capability monitoring — and now, not eventually, said Mayham. These abilities don’t arrive linearly; they spike. A model that was proven safe in testing today may not be safe six weeks later. Monitoring avoids surprises.

Mayham recommends these components for a monitoring stack:

Behavioral: Periodic prompts can force the model to explain reasoning on known benchmarks;

Activation: Probes that track activation patterns associated with specific reasoning modes;

Causal intervention: Steering tests that measure honesty about internal states.

This article has been edited throughout to more accurately describe the experiments. It originally appeared on InfoWorld.

Lire la suite sur ComputerWorld