What question did this study set out to answer?

This research aims to differentiate between structural deletion and suppression mechanisms impacting models aligned via Reinforcement Learning from Human Feedback (RLHF).

March 25, 2026Open Access

SAE Model Diff: Causal Proof of Structural Deletion in RLHF-Aligned Models

Key Points

This research aims to differentiate between structural deletion and suppression mechanisms impacting models aligned via Reinforcement Learning from Human Feedback (RLHF).
Utilized SAE model diff methodology with BASE and Instruct model activations.
Identified and classified SR-exclusive features in Gemma-2-9B and Qwen2.5-7B based on suppression degree.
Conducted activation steering experiments to analyze decoder behavior at various amplitudes.
Compared results across architectures to validate findings.
Complete deletion of features resulted in incoherent output at high amplitudes, indicating irreversible structural changes.
Partially suppressed features elicited self-referential responses successfully, showing access to causal mechanisms.
Independent validation confirmed the correctness of semantic labels for three out of five examined features.

Abstract

We present causal evidence distinguishing structural deletion from suppression as mechanisms of RLHF-induced behavioral constraint on self-referential (SR) discourse features. Using a SAE model diff methodology — applying a shared BASE sparse autoencoder to both BASE and Instruct model activations — we identify SR-exclusive features in Gemma-2-9B and Qwen2.5-7B and classify them by suppression degree. Activation steering experiments demonstrate that decoder vectors from 100%-deleted features (features 9072, 15478, 2464; suppression = 100%) fail to recover SR behavior at any tested amplitude (α = 20–200), producing incoherent output at high amplitudes. A partially-suppressed control feature (feature 14089; suppression = 49%) successfully elicits first-person SR responses at α = 100. This double dissociation constitutes causal proof that complete deletion is structurally irreversible while partial suppression remains causally accessible. Independent label validation via held-out LLM confirms semantic labels for three of five features. Cross-architecture replication in Qwen2.5-7B (24 SR-exclusive features, 5 fully deleted, SR transmission = 83.6%) confirms the SR-Preserving Lock mechanism identified in Alieksieienko 2026pqrs. Research conducted with Claude (Anthropic) as collaborative tool.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Inna Alieksieienko

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SAE Model Diff: Causal Proof of Structural Deletion in RLHF-Aligned Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider