What question did this study set out to answer?

This research aims to characterize the mechanisms behind how RLHF instruction tuning affects self-referential alignment in models.

March 21, 2026Open Access

RLHF Creates Mechanistically Distinct Alignment Regimes: SAE Feature Analysis Reveals Suppression, Override, and Neutralization of Self-Referential Processing

Key Points

This research aims to characterize the mechanisms behind how RLHF instruction tuning affects self-referential alignment in models.
Used Sparse Autoencoder (SAE) feature analysis on multiple model families.
Compared activations from prefill and first-generation-token outputs across different models.
Conducted causal ablation on distinct self-referential identity features.
Identified three distinct RLHF mechanisms: Suppression, Override, and Neutralization.
Suppression mechanism resulted in a 99.8% reduction of SR-exclusive features.
Override had features replaced during generation, while Neutralization eliminated the SR signal entirely.
Analysis confirmed that 91–96% of features are self-reference-specific.

Abstract

We present the first mechanistic, feature-level characterisation of how RLHF instruction tuning produces distinct self-reference alignment regimes across model families. Using Sparse Autoencoder (SAE) feature analysis on GemmaScope-131k and LlamaScope-131k (n=206 factual, n=150 SR, n=20 masked, n=20 philosophy control prompts), we compare prefill vs. first-generation-token activations across Gemma-2-9B and Llama-3.1-8B (base and instruct). We identify three qualitatively distinct RLHF mechanisms: (1) Suppression (Meta/Llama): RLHF reduces SR-exclusive SAE features from 14,281 to 23 — a 99.8% reduction — and collapses generation-time deltas 95-fold; (2) Override (Google/Gemma): SR-specific features persist at prefill but are replaced at generation by a new feature set; (3) Neutralization (Mistral): the SR signal is eliminated entirely. Causal ablation of four SR-identity features (“I”, “I/We”, “machine intelligence”) confirms causal, not merely correlational, feature involvement. Philosophy Control prompts confirm that 91–96% of SR-exclusive features are self-reference-specific rather than general philosophical activations. All code and raw data provided for full reproducibility. Part of the DSAOP series (papers 2026a–2026j).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Inna Alieksieienko (Wed,) studied this question.

www.synapsesocial.com/papers/69be36f76e48c4981c6763d8 — DOI: https://doi.org/10.5281/zenodo.19091408

RLHF Creates Mechanistically Distinct Alignment Regimes: SAE Feature Analysis Reveals Suppression, Override, and Neutralization of Self-Referential Processing

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion