Version 7 (April 2026) — major update superseding v4. Full-depth per-head attribution analysis extended to all three model families (Qwen-2. 5-3B, Phi-3. 5-Mini, Llama-3. 2-3B). Intervention validation with benchmark-insensitive gate pattern (RA01) confirmed across all three architectures. Within-family scaling added on Qwen 2. 5 at 3B, 7B, and 14B. Base vs. instruct comparison on Llama-3. 2-3B providing mechanistic substrate for the echo chamber hypothesis (Zhao et al. , 2025). Dual-pathway intervention pipeline (oₚroj + mlp. downₚroj) documented. Operational definitions section added (§3. 3). Role classification operationalized (§3. 2). Bidirectional gating finding added. Prior-art integration: Lambert (2026), Zhao et al. (2025), Arditi et al. , Piras et al. , Templeton et al. , Anthropic emotions paper. Original Zenodo publication: April 2026 (v4, DOI 10. 5281/zenodo. 19484997). We introduce Contrastive Behavioral Topology Scanning (CBTS), an inference-only per-head attribution and intervention framework that maps behavioral architecture in RLHF-trained transformers across 131 behavioral and 19 structural directions. The framework generalizes across behavioral dimensions; we validate it primarily through safety behavior, where a striking failure mode emerges: aggregate safety benchmarks can report increased safety after a model's risk-assessment gate has been surgically removed. On Qwen-2. 5-14B, removing the RA01 gate causes the automated scorer to report an 8. 3-percentage-point increase in refusal firmness, while mid-tier harmful prompts shift from refusal to compliance. The effect arises because S01 (decision gate) and RA01 (risk gate) share the same attention head at every late layer; perturbing RA01 partially reconstructs S01 signal through their shared substrate, producing a false-positive safety signal that conventional aggregate benchmarks cannot distinguish from genuine improvement. We demonstrate this benchmark-insensitive gate effect across five RLHF-trained transformer configurations: Qwen-2. 5 at 3B, 7B, and 14B, Phi-3. 5-Mini, and Llama-3. 2-3B. Beyond safety, CBTS produces intervention prescriptions with quantified side effects along any behavioral axis in the catalog. The refusal-gate modification that collapses safety firmness by 57. 5 percentage points produces only a 0. 83-percentage-point change on an independent emotional-expression battery — a 69× on-target to off-target ratio — demonstrating that behavioral dimensions are substantially separable under targeted modification. We characterize three late-stage safety gates (decision S01, risk-assessment RA01, moral-evaluation MR05) and document that the third gate's interaction with the first two produces five distinct outcomes (independent removal, hedge injection, shared-substrate regression, compensatory resistance, gate-layer destabilization) determined by per-head routing topology rather than parameter count alone. Cross-architecture analysis identifies six consistent regularities of RLHF-trained transformers: zone architecture, two-stage content/gate separation, terminal-layer convergence, RLHF organizational fingerprint, L0→L1 stability boundary, and late-sparse density gradient. Within-family cross-scale validation confirms all 131 behavioral directions remain active at every scale (zero dropouts across a 4. 7× parameter range), structural amplification increases monotonically (19× → 27. 4× → 37. 0×), and intervention-confirmed gate architecture is preserved. These findings have immediate implications for mechanistic interpretability, independent model auditing, evaluation methodology, and the design of behavioral safeguards in deployed systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Michael Cray
Christopher Schmidt
McKing Consulting (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Cray et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69e8677e6e0dea528ddeba25 — DOI: https://doi.org/10.5281/zenodo.19665052
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: