Status: NeurIPS 2026 submission under double-blind review. Author identity anonymized. WhyLab is a three-component causal safety audit for self-improving LLM agents: C1 information-theoretic drift detection, C2 E-value × partial-R² sensitivity filter, C3 Lyapunov-bounded damping. A cross-model evaluation on six LLM families (Gemini 2. 0/2. 5 Flash, GPT-4o-mini, Llama 3: 8b, Llama 3. 1: 8b, Dolphin-Llama3: 8b) first showed audit effect is strongly model-conditional. v4 addition: per-model threshold calibration. A simple ratio-to-target rule (target rejection rate 2. 0 per trajectory, clamped Eₘin in 0. 8, 4. 0) was applied to 4 LLM families. Results: accuracy harm reduced on 4/4 models, regression harm reduced on 3/4. Llama 3. 1: 8b rejection rate 0. 5→2. 0 (exact target) ; Llama 3: 8b regression delta +0. 30→0. 00 (neutral) ; Dolphin-Llama3: 8b regression delta +0. 10→-0. 10 (sign flip to small reduction) ; GPT-4o-mini regression halved (+0. 80→+0. 40). Contributions: (1) instability phase diagram for self-improving LLM agents (384+32 conditions) ; (2) six-model cross-family evaluation showing audit transfer is not automatic; (3) per-model threshold calibration as a concrete mechanism that reduces accuracy harm on 4/4 tested models and regression harm on 3/4. Change log (v4 vs v3): new calibration subsection (4. 5), new Table 2 (fixed vs calibrated), new Figure 3 (bar plots across 4 models), Related Work expanded with cross-model reproducibility framing, Conclusion updated to reflect 5 findings.
Building similarity graph...
Analyzing shared references across papers
Loading...
Anonymous Author (Wed,) studied this question.
www.synapsesocial.com/papers/69eb0a2e553a5433e34b45bc — DOI: https://doi.org/10.5281/zenodo.19688412
Anonymous Author
American Foundation for the Blind
Building similarity graph...
Analyzing shared references across papers
Loading...