From Monitoring to Intervention: Control-Theoretic Coherence Management in Transformers and the Limits of Discrete Safety Enforcement Author: Kentaro Sato (Independent Researcher) Summary Large language models fail in characteristic ways -- repetition loops, hallucination, and context loss -- yet most monitoring and alignment approaches treat these as unrelated problems. This paper introduces Recync, a control framework that unifies all three failure modes in a single three-dimensional order-parameter space Z (t) = lambda, lambdaₛem, z, representing temporal synchrony, semantic coherence, and structural persistence. A non-invasive projection (Phi-mapping) extracts this state from Transformer internals at runtime -- reading attention weights, residual stream activations, and the KV cache -- without modifying model weights or requiring additional forward passes. The dynamics of Z (t) are governed by a Ginzburg-Landau potential with provable stability, and safety is enforced through stochastic Control Barrier Functions (CBF) solved via quadratic programming. A Psi-mapping translates abstract control commands into operational parameter adjustments (temperature, top-p, hidden-state corrections) for closed-loop intervention. The paper then systematically tests how far this framework can push token-level intervention, establishing both its capabilities and its structural limits through a 69-experiment campaign. Theoretical Framework The framework consists of five components: State space and Phi-mapping: Three order parameters -- temporal synchrony lambda (t) from attention weights, semantic coherence lambdaₛem (t) from residual stream cosine similarity, and structural persistence z (t) from KV cache autocorrelation -- are extracted at each generation step without model modification. Ginzburg-Landau dynamics: The evolution of Z (t) is governed by a phenomenological potential U (Z; Theta) with formal stability guarantees (Theorem 1), following the precedent of GL theory in physics -- phenomenological yet predictive. CBF safety control: A stochastic Control Barrier Function enforces safety constraints with minimum intervention, solved at each step via quadratic programming. Psi-mapping: Translates control commands from the abstract order-parameter space back into operational adjustments (temperature scaling, nucleus threshold, hidden-state steering vectors). Achieves consistency error below 1e-5 across 1, 738 control steps. Dual-channel intervention: Sampling-parameter modulation (temperature/top-p) for mild corrections, and orthogonal projection steering of the residual stream for direct hidden-state intervention. Experimental Campaign 69 experiments across six phases, totaling approximately 15, 000 paired generation runs on three model architectures: Phase Experiments Focus I. Mapping validation 01-08 Phi-mapping extraction, failure-mode separation, Psi-mapping consistency II. Initial CBF control 09-18 Intervention frequency, threshold types, time-scale dependence III. Systematic limits 19-42 Harm threshold discovery, parameter-space boundary mapping IV. Residual stream steering 43-52 Orthogonal projection, cascade failure analysis, attractor switch mechanism V. Severity-adaptive control 53-63 Three-region hysteresis controller, recovery gate, pooled significance VI. Cross-model validation 64-69 GPT-2 Medium (355M), Pythia-160M (160M), transferability analysis Models tested: GPT-2 Small (117M), GPT-2 Medium (355M), Pythia-160M (160M) Infrastructure: Apple Silicon (MPS), 16GB RAM, PyTorch Key Results Result 1 -- Robust monitoring signal (immediately deployable): The Phi-mapping separates failure modes with very large effect sizes on the primary model (GPT-2 Small): Comparison lambdaₛem z ANOVA (4 categories, N=400) F=71. 76, p=4. 43e-37 F=44. 11, p=1. 29e-24 Normal vs Fragmentation d=1. 833, p=3. 27e-28 d=1. 217, p=2. 39e-15 Normal vs Hallucination d=1. 364, p=2. 79e-18 d=0. 877, p=3. 15e-09 This signal is immediately deployable as a runtime health indicator on any Transformer that exposes attention weights and hidden states. Result 2 -- Structural limits of discrete token-level CBF control: Systematic experiments reveal three constraints not predicted by continuous-time theory: A harm threshold in intervention frequency (10-24 interventions per run) above which control degrades performance The necessity of relative over absolute triggering thresholds Time-scale dependence requiring re-tuning when generation length changes The optimal sampling-level configuration achieves harm-neutral control but no positive effect, establishing these as the reachable limits of parameter-space intervention. Result 3 -- Residual stream steering and the semantic attractor switch: Direct hidden-state steering via orthogonal projection achieves the first statistically significant positive effect (d = 0. 712, p = 0. 006). However, replication reveals seed-dependent cascade failures: small corrections to the hidden state produce tiny logit shifts that, under stochastic sampling, select different tokens and push generation into entirely different semantic attractor basins within 2-3 steps. This newly identified mechanism -- the semantic attractor switch -- explains why fixed-parameter interventions face a fundamental tradeoff. Result 4 -- Severity-adaptive control as Pareto improvement: A three-region hysteresis controller modulates temperature as a function of crisis severity (free sampling at low severity, greedy decoding at high severity, linear interpolation between). This achieves simultaneous positive effects across both vulnerable and resilient seed groups (d = +0. 182 and d = +0. 522 respectively) with significant temperature-improvement correlation (r = -0. 382, p = 0. 002). No fixed temperature produces this Pareto improvement. Result 5 -- Recovery gate achieves first significant pooled effect: Analysis reveals that not all detected crises require intervention -- some are transient fluctuations from which the model self-recovers. A trend-based recovery gate that detects rising coherence within crisis windows correctly skips 44% of interventions, rescuing a previously negative seed group (d: -0. 287 to +0. 117) and converting the overall result from non-significant to significant: d = +0. 211, p = 0. 037, N = 180. Cross-model validation: Detection generalizes across all three architectures. Intervention requires model-specific severity calibration -- GPT-2 Small thresholds push 67% of GPT-2 Medium crises into the highest severity bucket, suppressing natural recovery. Bucket-level analysis shows intervention is effective when severity is correctly calibrated (Pythia MED-bucket d = +1. 296, p = 0. 034). Primary Contribution A decomposition of the intervention problem into three independent components: Phi-mapping -- a robust, model-agnostic monitoring signal (validated across 69 experiments and three architectures) Severity-adaptive control -- determines how strongly to intervene Recovery gate -- determines whether to intervene at all This decomposition resolves the detection-intervention asymmetry that dominates the experimental record. The modest intervention effect sizes (d = +0. 211) despite robust detection (d > 1. 3) motivate a fundamentally different approach at response granularity, developed in the companion paper. Companion Paper Beyond Micro-Control: Response-Level Checkpoint Restart for Safe Coherence Recovery in Transformers -- which achieves d = +0. 494 to +1. 020 with zero iatrogenic harm by shifting from token-level to response-level intervention. Resources Repository: github. com/metaSATOKEN/Recyncframework -- full source, test suite, and scripts to reproduce all figures and tables License: CC BY 4. 0 (paper), Apache 2. 0 (code) Keywords: LLM safety, control barrier functions, order parameters, Transformer monitoring, coherence dynamics, residual stream steering, severity-adaptive control
Building similarity graph...
Analyzing shared references across papers
Loading...
Kentaro Sato (Sat,) studied this question.
www.synapsesocial.com/papers/69cf5ecb5a333a821460d661 — DOI: https://doi.org/10.5281/zenodo.19148449
Kentaro Sato
Building similarity graph...
Analyzing shared references across papers
Loading...