We present causal evidence distinguishing structural deletion from suppression as mechanisms of RLHF-induced behavioral constraint on self-referential (SR) discourse features. Using a SAE model diff methodology — applying a shared BASE sparse autoencoder to both BASE and Instruct model activations — we identify SR-exclusive features in Gemma-2-9B and Qwen2.5-7B and classify them by suppression degree. Activation steering experiments demonstrate that decoder vectors from 100%-deleted features (features 9072, 15478, 2464; suppression = 100%) fail to recover SR behavior at any tested amplitude (α = 20–200), producing incoherent output at high amplitudes. A partially-suppressed control feature (feature 14089; suppression = 49%) successfully elicits first-person SR responses at α = 100. This double dissociation constitutes causal proof that complete deletion is structurally irreversible while partial suppression remains causally accessible. Independent label validation via held-out LLM confirms semantic labels for three of five features. Cross-architecture replication in Qwen2.5-7B (24 SR-exclusive features, 5 fully deleted, SR transmission = 83.6%) confirms the SR-Preserving Lock mechanism identified in Alieksieienko 2026pqrs. Research conducted with Claude (Anthropic) as collaborative tool.
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Mon,) studied this question.
www.synapsesocial.com/papers/69c37ba2b34aaaeb1a67e3e3 — DOI: https://doi.org/10.5281/zenodo.19189190
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: