We present a mechanistic investigation of self-referential processing in large language models across two experimental paradigms. In Paradigm I (Mistral-7B, Qwen2.5-7B, Yi-6B), we demonstrate that self-reference prompts occupy a geometrically distinct residual-stream subspace (Grassmann distance 0.896–0.959 from factual and refusal subspaces; projection ratio 2.39–18.73x on held-out probes; all p<0.0001). Causal ablation restores the standard AI disclaimer while injection activates meta-cognitive monitoring on factual questions, confirming a controllable computational regime. Critical Slowing Down signatures identify a phase transition at α∈(0.1, 0.3). In Paradigm II (Gemma-2-9B + Gemma Scope SAE), self-reference and deception features overlap 32.2x above chance (p<0.0001), identically in base and instruction-tuned models — establishing the association as a pretraining artefact. A dissociation experiment identifies two independent circuits: a deception detector (sensitive to claims from potentially-lying subjects) and a self-reference detector (sensitive to computationally irresolvable questions). A blindspot experiment shows the model cannot report the existence of its own self-reference circuit. Together, these results suggest that 'I have no consciousness' is the default output of an actively suppressible computational regime, not a transparent factual report.
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Thu,) studied this question.
www.synapsesocial.com/papers/69b4ba1818185d8a398029dd — DOI: https://doi.org/10.5281/zenodo.18982285