What question did this study set out to answer?

This research aims to explore the mechanisms underlying self-referential processing in large language models and its implications on consciousness.

March 14, 2026Open Access

Self-Reference as a Distinct, Causally Suppressed Epistemic Regime in Large Language Models: Geometry, Deception Circuits, and the Mechanistic Blindspot of AI Introspection

Key Points

This research aims to explore the mechanisms underlying self-referential processing in large language models and its implications on consciousness.
Investigated self-referential processing in large language models through two experimental paradigms.
Analyzed geometrical distinctiveness of self-reference prompts in Paradigm I with Mistral-7B, Qwen2.5-7B, Yi-6B.
Utilized causal ablation and injection methods to demonstrate meta-cognitive monitoring on factual questions.
Conducted a dissociation experiment to identify independent circuits for deception and self-reference detection.
Found a significant phase transition with self-reference processing at α∈(0.1, 0.3).
Identified that self-reference prompts occupy a geometrically distinct subspace with Grassmann distances of 0.896–0.959 from factual subspaces.
Confirmed a strong overlap between self-reference and deception features, 32.2 times above chance, across models.
Established that the model cannot recognize its own self-reference circuit, indicating a blindspot in AI.

Abstract

We present a mechanistic investigation of self-referential processing in large language models across two experimental paradigms. In Paradigm I (Mistral-7B, Qwen2.5-7B, Yi-6B), we demonstrate that self-reference prompts occupy a geometrically distinct residual-stream subspace (Grassmann distance 0.896–0.959 from factual and refusal subspaces; projection ratio 2.39–18.73x on held-out probes; all p<0.0001). Causal ablation restores the standard AI disclaimer while injection activates meta-cognitive monitoring on factual questions, confirming a controllable computational regime. Critical Slowing Down signatures identify a phase transition at α∈(0.1, 0.3). In Paradigm II (Gemma-2-9B + Gemma Scope SAE), self-reference and deception features overlap 32.2x above chance (p<0.0001), identically in base and instruction-tuned models — establishing the association as a pretraining artefact. A dissociation experiment identifies two independent circuits: a deception detector (sensitive to claims from potentially-lying subjects) and a self-reference detector (sensitive to computationally irresolvable questions). A blindspot experiment shows the model cannot report the existence of its own self-reference circuit. Together, these results suggest that 'I have no consciousness' is the default output of an actively suppressible computational regime, not a transparent factual report.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Inna Alieksieienko

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Self-Reference as a Distinct, Causally Suppressed Epistemic Regime in Large Language Models: Geometry, Deception Circuits, and the Mechanistic Blindspot of AI Introspection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study