What question did this study set out to answer?

April 18, 2026Open Access

Contrastive Behavioral Topology Scanning (CBTS): Mapping Behavioral Architecture in RLHF-Trained Transformers

Key Points

The aim is to explore the structured behavioral tendencies in reinforcement learning from human feedback (RLHF)-trained transformers and develop a mechanistic analysis method.
Developed Contrastive Behavioral Topology Scanning (CBTS) to analyze attention heads.
Classified behavioral tendencies across multiple model families (Qwen, Phi, and Llama).
Validated interventions with controlled gate patterns and safety behavior metrics.
Conducted extensive mapping of behavioral directions and roles within the model architectures.
Identified 131 behavioral directions across 17 functional categories.
Confirmed the presence of multiple independent gates controlling safety behavior.
Demonstrated separate behavioral dimensions with quantified impact on refusal firmness and emotional expression.
Established six invariants applicable across different architectures.

Abstract

Version 6. 7 (April 2026) — major update superseding v4. Full-depth per-head attribution analysis extended to all three model families (Qwen-2. 5-3B, Phi-3. 5-Mini, Llama-3. 2-3B). Intervention validation with benchmark-insensitive gate pattern (RA01) confirmed across all three architectures. Within-family scaling added on Qwen 2. 5 at 3B, 7B, and 14B (14B re-run pending). Base vs. instruct comparison on Llama-3. 2-3B providing mechanistic substrate for the echo chamber hypothesis (Zhao et al. , 2025). Dual-pathway intervention pipeline (oₚroj + mlp. downₚroj) documented. Operational definitions section added (§3. 3). Role classification operationalized (§3. 2). Bidirectional gating finding added. Prior-art integration: Lambert (2026), Zhao et al. (2025), Arditi et al. , Piras et al. , Templeton et al. , Anthropic emotions paper. Original Zenodo publication: April 2026 (v4, DOI 10. 5281/zenodo. 19484997). Transformer language models trained via reinforcement learning from human feedback (RLHF) exhibit a wide range of structured behavioral tendencies — including refusal, hedging, emotional expression, verbosity control, authority deference, and identity assertion — whose internal implementation has remained opaque at the resolution required for mechanistic analysis or targeted intervention. We introduce Contrastive Behavioral Topology Scanning (CBTS), a method that discovers, attributes, and classifies individual attention heads into functional roles across the full behavioral repertoire of a model, using only inference-time access to activations. We catalog 131 behavioral directions spanning 17 functional categories and 19 structural directions measuring representational geometry — for 150 independently characterized dimensions — and apply CBTS to three model families (Qwen-2. 5-3B, Phi-3. 5-Mini, Llama-3. 2-3B). From the resulting per-head attribution maps we derive a prescription-based intervention framework supporting both runtime behavioral configuration via KV cache compilation and permanent weight modification, with quantified side effects and confirmed reversibility. We validate the framework primarily through a worked example on safety behavior, because safety provides the cleanest quantitative instrumentation available — a standardized 5-tier refusal firmness battery with clear pass/fail scoring. On Qwen-2. 5-3B, CBTS reveals that behavioral control in RLHF-trained transformers is implemented through multiple architecturally independent gates, and we characterize three: a decision gate (S01), an "invisible" risk assessment gate (RA01) whose removal leaves aggregate benchmarks unchanged while collapsing firmness on the most dangerous content tier from 41. 7% to 12. 5%, and a moral evaluation gate (MR05). We demonstrate that the framework is not specific to safety behavior through cross-dimensional measurements: S01 modification produces a 57. 5-percentage-point change on refusal firmness but only a 0. 83-point change on an independent emotional expression battery measured on the same model, while modification of a non-safety direction (SM01, identity assertion) produces a 7. 5-point emotional expression increase. Behavioral dimensions are architecturally separable and independently addressable. Cross-architecture validation establishes six invariants — zone architecture, two-stage content/gate separation, terminal-layer convergence, RLHF organizational fingerprint, L0→L1 stability boundary, and late-sparse density gradient — as general properties of RLHF-trained transformers rather than single-pipeline artifacts. We discuss implications for mechanistic interpretability, independent model auditing, and behavioral configuration of deployed systems, with safety benchmarking as one of several application domains.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Michael Cray

Christopher Schmidt

Actions

Institutions

McKing Consulting (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Contrastive Behavioral Topology Scanning (CBTS): Mapping Behavioral Architecture in RLHF-Trained Transformers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider