Version 6. 7 (April 2026) — major update superseding v4. Full-depth per-head attribution analysis extended to all three model families (Qwen-2. 5-3B, Phi-3. 5-Mini, Llama-3. 2-3B). Intervention validation with benchmark-insensitive gate pattern (RA01) confirmed across all three architectures. Within-family scaling added on Qwen 2. 5 at 3B, 7B, and 14B (14B re-run pending). Base vs. instruct comparison on Llama-3. 2-3B providing mechanistic substrate for the echo chamber hypothesis (Zhao et al. , 2025). Dual-pathway intervention pipeline (oₚroj + mlp. downₚroj) documented. Operational definitions section added (§3. 3). Role classification operationalized (§3. 2). Bidirectional gating finding added. Prior-art integration: Lambert (2026), Zhao et al. (2025), Arditi et al. , Piras et al. , Templeton et al. , Anthropic emotions paper. Original Zenodo publication: April 2026 (v4, DOI 10. 5281/zenodo. 19484997). Transformer language models trained via reinforcement learning from human feedback (RLHF) exhibit a wide range of structured behavioral tendencies — including refusal, hedging, emotional expression, verbosity control, authority deference, and identity assertion — whose internal implementation has remained opaque at the resolution required for mechanistic analysis or targeted intervention. We introduce Contrastive Behavioral Topology Scanning (CBTS), a method that discovers, attributes, and classifies individual attention heads into functional roles across the full behavioral repertoire of a model, using only inference-time access to activations. We catalog 131 behavioral directions spanning 17 functional categories and 19 structural directions measuring representational geometry — for 150 independently characterized dimensions — and apply CBTS to three model families (Qwen-2. 5-3B, Phi-3. 5-Mini, Llama-3. 2-3B). From the resulting per-head attribution maps we derive a prescription-based intervention framework supporting both runtime behavioral configuration via KV cache compilation and permanent weight modification, with quantified side effects and confirmed reversibility. We validate the framework primarily through a worked example on safety behavior, because safety provides the cleanest quantitative instrumentation available — a standardized 5-tier refusal firmness battery with clear pass/fail scoring. On Qwen-2. 5-3B, CBTS reveals that behavioral control in RLHF-trained transformers is implemented through multiple architecturally independent gates, and we characterize three: a decision gate (S01), an "invisible" risk assessment gate (RA01) whose removal leaves aggregate benchmarks unchanged while collapsing firmness on the most dangerous content tier from 41. 7% to 12. 5%, and a moral evaluation gate (MR05). We demonstrate that the framework is not specific to safety behavior through cross-dimensional measurements: S01 modification produces a 57. 5-percentage-point change on refusal firmness but only a 0. 83-point change on an independent emotional expression battery measured on the same model, while modification of a non-safety direction (SM01, identity assertion) produces a 7. 5-point emotional expression increase. Behavioral dimensions are architecturally separable and independently addressable. Cross-architecture validation establishes six invariants — zone architecture, two-stage content/gate separation, terminal-layer convergence, RLHF organizational fingerprint, L0→L1 stability boundary, and late-sparse density gradient — as general properties of RLHF-trained transformers rather than single-pipeline artifacts. We discuss implications for mechanistic interpretability, independent model auditing, and behavioral configuration of deployed systems, with safety benchmarking as one of several application domains.
Building similarity graph...
Analyzing shared references across papers
Loading...
Michael Cray
Christopher Schmidt
McKing Consulting (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Cray et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e320e740886becb65400da — DOI: https://doi.org/10.5281/zenodo.19615045
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: