Status: NeurIPS 2026 submission under double-blind review. Author identity anonymized. WhyLab is a three-component causal safety audit for self-improving LLM agents: C1 information-theoretic drift detection, C2 E-value × partial-R² sensitivity filter, C3 Lyapunov-bounded damping. A cross-model evaluation on six LLM families (Gemini 2. 0/2. 5 Flash, GPT-4o-mini, Llama 3: 8b, Llama 3. 1: 8b, Dolphin-Llama3: 8b) first showed audit effect is strongly model-conditional. v4 addition: per-model threshold calibration. A simple ratio-to-target rule (target rejection rate 2. 0 per trajectory, clamped Eₘin in 0. 8, 4. 0) was applied to 4 LLM families. Results: accuracy harm reduced on 4/4 models, regression harm reduced on 3/4. Llama 3. 1: 8b rejection rate 0. 5→2. 0 (exact target) ; Llama 3: 8b regression delta +0. 30→0. 00 (neutral) ; Dolphin-Llama3: 8b regression delta +0. 10→-0. 10 (sign flip to small reduction) ; GPT-4o-mini regression halved (+0. 80→+0. 40). Contributions: (1) instability phase diagram for self-improving LLM agents (384+32 conditions) ; (2) six-model cross-family evaluation showing audit transfer is not automatic; (3) per-model threshold calibration as a concrete mechanism that reduces accuracy harm on 4/4 tested models and regression harm on 3/4. Change log (v4 vs v3): new calibration subsection (4. 5), new Table 2 (fixed vs calibrated), new Figure 3 (bar plots across 4 models), Related Work expanded with cross-model reproducibility framing, Conclusion updated to reflect 5 findings.
Building similarity graph...
Analyzing shared references across papers
Loading...
Anonymous Author
American Foundation for the Blind
Building similarity graph...
Analyzing shared references across papers
Loading...
Anonymous Author (Wed,) studied this question.
www.synapsesocial.com/papers/69eb0a2e553a5433e34b45bc — DOI: https://doi.org/10.5281/zenodo.19688412