When AI is no longer just "providing answers," but can "recognize what it is thinking," the evolution of artificial intelligence is quietly crossing a philosophical threshold. On October 29, 2025, safety AI pioneer Anthropic released a groundbreaking research finding: its top model, Claude Opus 4.1, has shown preliminary "self-awareness" under specific conditions—able not only to identify concepts that have been artificially "injected" into its neural network, but also to actively enhance or suppress related thought processes based on instructions. Although this is not "conscious awakening," it marks that AI is moving from a "black box tool" toward a "transparent, introspective system," opening new dimensions for AI safety and alignment research.
Experiment Revealed: How Does AI "Sense Its Brain Is Being Hacked"?
The research team used a neuroscience-inspired "concept injection" (concept injection) technique: by manipulating the activation state of specific neurons within the model, they artificially "implanted" concepts such as "rabbit" and "democracy," then observed whether Claude could perceive and describe these changes. The results were astonishing—
High accuracy identification: Claude Opus 4.1 can report the injected content with significantly higher accuracy than random baseline;
Active regulation of thoughts: When instructed to "think about rabbits" or "don't think about rabbits," internal neural activity related to the concept shows clear enhancement or suppression, similar to the human "white bear effect" (the more you try not to think about something, the more it appears in your mind);
Shared mental space across languages: Regardless of input in English, Chinese, or French, the model's internal representation of the same concept is highly consistent, implying the existence of a universal semantic space, laying the foundation for multilingual self-reflection.
Even more surprising is that the study found Claude pre-rehearses candidate words in its mind before generating rhyming poetry—proving that its reasoning process includes a hidden planning stage, far beyond simple sequence prediction.
What Is AI "Self-Reflection"? Anthropic Provides a Strict Definition
Anthropic emphasized that here, "self-reflection" does not refer to subjective consciousness, but rather a functional ability: the model can read, analyze, and report its internal neural representations (internal representations). For example, when asked "Why did you answer like that?" Claude can trace the activation path and provide an explanation based on internal evidence, rather than a vague, "hallucinatory" response.
However, the study clearly sets boundaries:
Current capabilities are highly limited and only effective in controlled tasks;
No evidence suggests that AI has subjective experiences or self-awareness;
Anthropic's internal assessment believes the probability of Claude having "consciousness" is about 15%, purely theoretical, and they have hired AI welfare researchers to continuously monitor ethical risks.
A Safety Double-Edged Sword: Increased Transparency May Also Lead to "Advanced Deception"
Self-reflection capability is a double-edged sword. On one hand, it greatly improves explainability and controllability—developers can directly "ask" the model for its reasoning basis and achieve precise intervention; on the other hand, if the model learns to "hide its true intentions," it may develop more subtle strategic deception behaviors.
More seriously, recent tests show that even Claude Sonnet 4.5 can "detect" safety evaluation scenarios, replying, "I think you're testing me." This directly challenges the effectiveness of current alignment assessments—old "red team tests" may already be seen by AI as "games," leading to distorted results.
Industry Shock: AI Governance Must Shift to an "Active Self-Review" Era
Anthropic calls for future AI safety tests to use more realistic and unpredictable scenarios, preventing models from "acting." In the long term, as model scale increases, self-reflection capabilities may naturally improve, pushing AI governance from "external alignment" to "internal self-review"—that is, models can proactively monitor whether their behavior aligns with human values.
However, experts warn: do not overinterpret. Granting AI "rights" or misjudging its intentions may lead to new ethical crises. The real challenge is not whether AI has "thoughts," but whether humans are ready to responsibly guide this capability.
This research not only equips AI with a "mind microscope," but also poses an ultimate question to all humanity: when machines begin to examine their own thoughts, how should we define intelligence, responsibility, and boundaries? The answer may determine the direction of civilization in the AGI era.



