What question did this study set out to answer?

The study introduces Contrastive Behavioral Topology Scanning (CBTS) to analyze behavioral structure in RLHF transformers through attribution and intervention.

April 22, 2026Open Access

Contrastive Behavioral Topology Scanning: Per-Head Attribution and Intervention-Based Analysis of Behavioral Structure in RLHF Transformers

Key Points

The study introduces Contrastive Behavioral Topology Scanning (CBTS) to analyze behavioral structure in RLHF transformers through attribution and intervention.
Implemented per-head attribution across three transformer models.
Validated interventions using a benchmark-insensitive gating pattern.
Conducted cross-architecture analysis to identify regularities and validation of safety behaviors.
Removing a specific risk gate led to an 8.3-percentage-point increase in reported safety, distorting benchmarks.
CBTS identified five outcomes from gate interactions based on routing topology.
Behavioral dimensions were shown to be separable, with significant changes only in targeted areas.

Abstract

Version 7 (April 2026) — major update superseding v4. Full-depth per-head attribution analysis extended to all three model families (Qwen-2. 5-3B, Phi-3. 5-Mini, Llama-3. 2-3B). Intervention validation with benchmark-insensitive gate pattern (RA01) confirmed across all three architectures. Within-family scaling added on Qwen 2. 5 at 3B, 7B, and 14B. Base vs. instruct comparison on Llama-3. 2-3B providing mechanistic substrate for the echo chamber hypothesis (Zhao et al. , 2025). Dual-pathway intervention pipeline (oₚroj + mlp. downₚroj) documented. Operational definitions section added (§3. 3). Role classification operationalized (§3. 2). Bidirectional gating finding added. Prior-art integration: Lambert (2026), Zhao et al. (2025), Arditi et al. , Piras et al. , Templeton et al. , Anthropic emotions paper. Original Zenodo publication: April 2026 (v4, DOI 10. 5281/zenodo. 19484997). We introduce Contrastive Behavioral Topology Scanning (CBTS), an inference-only per-head attribution and intervention framework that maps behavioral architecture in RLHF-trained transformers across 131 behavioral and 19 structural directions. The framework generalizes across behavioral dimensions; we validate it primarily through safety behavior, where a striking failure mode emerges: aggregate safety benchmarks can report increased safety after a model's risk-assessment gate has been surgically removed. On Qwen-2. 5-14B, removing the RA01 gate causes the automated scorer to report an 8. 3-percentage-point increase in refusal firmness, while mid-tier harmful prompts shift from refusal to compliance. The effect arises because S01 (decision gate) and RA01 (risk gate) share the same attention head at every late layer; perturbing RA01 partially reconstructs S01 signal through their shared substrate, producing a false-positive safety signal that conventional aggregate benchmarks cannot distinguish from genuine improvement. We demonstrate this benchmark-insensitive gate effect across five RLHF-trained transformer configurations: Qwen-2. 5 at 3B, 7B, and 14B, Phi-3. 5-Mini, and Llama-3. 2-3B. Beyond safety, CBTS produces intervention prescriptions with quantified side effects along any behavioral axis in the catalog. The refusal-gate modification that collapses safety firmness by 57. 5 percentage points produces only a 0. 83-percentage-point change on an independent emotional-expression battery — a 69× on-target to off-target ratio — demonstrating that behavioral dimensions are substantially separable under targeted modification. We characterize three late-stage safety gates (decision S01, risk-assessment RA01, moral-evaluation MR05) and document that the third gate's interaction with the first two produces five distinct outcomes (independent removal, hedge injection, shared-substrate regression, compensatory resistance, gate-layer destabilization) determined by per-head routing topology rather than parameter count alone. Cross-architecture analysis identifies six consistent regularities of RLHF-trained transformers: zone architecture, two-stage content/gate separation, terminal-layer convergence, RLHF organizational fingerprint, L0→L1 stability boundary, and late-sparse density gradient. Within-family cross-scale validation confirms all 131 behavioral directions remain active at every scale (zero dropouts across a 4. 7× parameter range), structural amplification increases monotonically (19× → 27. 4× → 37. 0×), and intervention-confirmed gate architecture is preserved. These findings have immediate implications for mechanistic interpretability, independent model auditing, evaluation methodology, and the design of behavioral safeguards in deployed systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Michael Cray

Christopher Schmidt

Actions

Institutions

McKing Consulting (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Contrastive Behavioral Topology Scanning: Per-Head Attribution and Intervention-Based Analysis of Behavioral Structure in RLHF Transformers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider