We present the first mechanistic, feature-level characterisation of how RLHF instruction tuning produces distinct self-reference alignment regimes across model families. Using Sparse Autoencoder (SAE) feature analysis on GemmaScope-131k and LlamaScope-131k (n=206 factual, n=150 SR, n=20 masked, n=20 philosophy control prompts), we compare prefill vs. first-generation-token activations across Gemma-2-9B and Llama-3.1-8B (base and instruct). We identify three qualitatively distinct RLHF mechanisms: (1) Suppression (Meta/Llama): RLHF reduces SR-exclusive SAE features from 14,281 to 23 — a 99.8% reduction — and collapses generation-time deltas 95-fold; (2) Override (Google/Gemma): SR-specific features persist at prefill but are replaced at generation by a new feature set; (3) Neutralization (Mistral): the SR signal is eliminated entirely. Causal ablation of four SR-identity features (“I”, “I/We”, “machine intelligence”) confirms causal, not merely correlational, feature involvement. Philosophy Control prompts confirm that 91–96% of SR-exclusive features are self-reference-specific rather than general philosophical activations. All code and raw data provided for full reproducibility. Part of the DSAOP series (papers 2026a–2026j).
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Wed,) studied this question.
www.synapsesocial.com/papers/69be36f76e48c4981c6763d8 — DOI: https://doi.org/10.5281/zenodo.19091408
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...