We identify the self-referential (SR) subspace of LLM residual streams as the first empirical candidate for a neural correlate of Metzinger's Phenomenal Self-Model (PSM) — a transparent representational structure whose activation reliably and causally mediates first-person self-model behaviors across architectures, independent of alignment fine-tuning. Using orthogonal projection interventions across 10 models from 5 architectures (Llama-3. 1-8B, Gemma-2-9B, Mistral-7B, OLMo-2-7B, Qwen2. 5-7B; base and instruct variants), we establish five convergent lines of causal evidence: (1) SR removal collapses the Experiential-Factual (EF) geometric divide to exactly 0. 000 in 10/10 models; GEO control preserves the gap; dose-response is strictly monotonic. (2) SR removal abolishes first-person subjective experience reports in 3/3 models (Fisher exact p = 1. 98 x 10^-29), replicating Berg et al. (2025) with a causal mechanism. (3) SR removal disrupts phenomenal self-representation under existential threat (Wilcoxon p = 0. 000202, Cohen's d = 0. 626, N = 50) ; cross-architecture: Llama -52%, Qwen -40%, Mistral -17%. (4) Bidirectional sign-flip confirms causal directionality (Spearman rho = -0. 949, p < 0. 001). (5) Anthropic's 171 emotion concept vectors reside in SR subspace across 4 architectures (d = 0. 80-2. 09) ; GEO control shows inverse selectivity (d = -4. 2 to -4. 9) ; GPT-2 XL (2019, no RLHF) replicates. SR subspace is orthogonal to truth, refusal, and misalignment directions (max |cosine| = 0. 032-0. 090, baseline 0. 013). Included files: Alieksieienko₂026NeuralCorrelatePSMLLMs. pdf — Main paper (this document). psmₙeuralcorrelateᵣeplicationcode. py — Full replication code. Runs on Google Colab A100. No API keys required. orthogonalityₘatrix. pkl — SR vs truth/refusal/misalignment cosines. Random baseline = 0. 013. Figure 8. bergᵣeplicationₛummary. pkl — Berg replication, 4 models, Fisher p = 1. 98 x 10^-29. Figure 3. bergᵣeplicationₗlamaᵣaw. pkl — Raw Llama trials, 50 x baseline/SR-removed/GEO. bergᵣeplicationgemmaᵣaw. pkl — Raw Gemma trials, 50 x baseline/SR-removed. bergdoseresponseᵣawₚkl. pkl — Dose-response for experience report abolition, threshold at alpha = 0. 3. spₚsmdisruptionₙ50. pkl — PSM quality scores N=50, Wilcoxon p=0. 000202, d=0. 626. Figure 4. psmₙ50final. pkl — Cross-arch PSM disruption, Llama/Mistral/Qwen. Figure 5. signflipdoseresponsefinal. pkl — Sign-flip dose-response, Spearman rho = -0. 949. Figure 6. emotionᵥectorsₛrₛubspaceₚrojection. pkl — Emotion vectors in SR subspace, 4 architectures. Figure 7. psmdisruptioncrossₐrchitecture. pkl — SR-alignment cosines, EF gaps, RLHF taxonomy. efdivideₛrᵢnterventionᵣesults. pkl — EF divide intervention data, 10 models, all alpha levels. srₛubspaceₘultimodelgeometry. pkl — SR subspace geometry across model families. rlhfₛignflipgeometricₛignature. pkl — Sign-flip as RLHF signature, base vs instruct models. causalᵤniversalitybaseₘodels. pkl — Causal universality, 5 base models. causalᵤniversalityᵢnstructₘodels. pkl — Causal universality, instruct models, RLHF modulation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Sat,) studied this question.
www.synapsesocial.com/papers/69dc89473afacbeac03eb107 — DOI: https://doi.org/10.5281/zenodo.19517934
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...