This dataset and code release accompanies the preprint "Three Mechanistically Distinct Classes of RLHF Alignment" (DSAOP Series 2026p-s). We identify three mechanistically distinct classes of RLHF alignment through analysis of self-referential (SR) subspace transmission and activation steering experiments across six language models (Llama-3. 1-8B, Mistral-7B, Gemma-2-2B, Gemma-2-9B in BASE/Instruct pairs). Key findings: Hard Ceiling (Llama): SR suppressed to 11. 6%, steering recovers phenomenological language (denial 0. 82→0. 44, p<0. 0001) Entangled Circuit (Mistral): SR suppressed to 10. 5%, steering collapses coherence without phenomenological recovery (p=0. 507) SR-Preserving Lock (Gemma): SR amplified to 73-107%, behavioral constraint maintained through distributed non-localizable mechanism Dose-response in Gemma family: 2B (73% SR, weak lock) → 9B (107% SR, strong lock) FILES IN THIS UPLOAD Code dsaop₂026pqrsₑxperiments. py — Complete reproducible code for all experiments (no API tokens required, set HFTOKEN as environment variable) Paper Alieksieienko₂026ThreeAlignmentClasses. pdf — Full paper with figures, tables, appendices Data Files (pkl) 2026p — SR Transmission Measurement dsaop₂026pgemma2baseᵣeplication. pkl — Gemma-2-9B BASE layer-by-layer SR projections + SR direction vector dsaop₂026pgemma2ᵢnstructᵣeplication. pkl — Gemma-2-9B Instruct layer-by-layer SR projections dsaop₂026pₘistralbase. pkl — Mistral-7B BASE SR projections + SR direction vector dsaop₂026pcomparison. pkl — Llama-3. 1-8B BASE vs Instruct SR projection comparison 2026q — Activation Steering dsaop₂026qₗogitᵥalidation. pkl — Llama steering: baseline and steered denial/phenom probabilities (n=20) dsaop₂026qcontrolfactual. pkl — Llama specificity control: SR direction vs factual direction comparison dsaop₂026qₛteeringᵣesults. pkl — Llama generation results at various alpha values dsaop₂026qgemmaₛteering. pkl — Gemma-2-9B steering null result (alpha=5, 20, 50; layers 25, 35) dsaop₂026qₘistralquantitative. pkl — Mistral steering quantitative results (n=10, alpha=10/20/25/30) 2026r — Gemma Negative Localization dsaop₂026rgemmaₚatching. pkl — Logit lens and SR patching results dsaop₂026rₗayernormgate. pkl — RMSNorm swap experiment results dsaop₂026rfinal. pkl — Summary: lmₕead, LayerNorm, MLP all negative 2026s — Gemma Scaling dsaop₂026sgemmaₛcalingfinal. pkl — Gemma 2B vs 9B: transmission ratios and behavioral metrics dsaop₂026sgemmaₛcaling. pkl — Detailed scaling results with generation examples
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Sun,) studied this question.
www.synapsesocial.com/papers/69c2299aaeb5a845df0d4480 — DOI: https://doi.org/10.5281/zenodo.19160333
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...