What question did this study set out to answer?

The aim is to assess the effectiveness of a three-component audit system for self-improving LLMs, focusing on model-specific calibrations.

April 24, 2026Open Access

WhyLab v4: Per-Model Threshold Calibration for a Causal Safety Audit of Self-Improving LLM Agents

Key Points

The aim is to assess the effectiveness of a three-component audit system for self-improving LLMs, focusing on model-specific calibrations.
Evaluated six families of LLMs using a causal safety audit consisting of drift detection, sensitivity filtering, and damping methods.
Applied a per-model threshold calibration to four LLM families with a ratio-to-target rule for rejecting trajectories.
Conducted cross-model evaluations to analyze the effects of the interventions on accuracy and regression harm.
All tested models showed reduced accuracy harm after calibration.
Three out of four models exhibited reduced regression harm, indicating a successful intervention across multiple families.
Specific improvements included a Llama 3.1:8b rejection rate that met the target, and other models showed various levels of regression reduction.

Abstract

Status: NeurIPS 2026 submission under double-blind review. Author identity anonymized. WhyLab is a three-component causal safety audit for self-improving LLM agents: C1 information-theoretic drift detection, C2 E-value × partial-R² sensitivity filter, C3 Lyapunov-bounded damping. A cross-model evaluation on six LLM families (Gemini 2. 0/2. 5 Flash, GPT-4o-mini, Llama 3: 8b, Llama 3. 1: 8b, Dolphin-Llama3: 8b) first showed audit effect is strongly model-conditional. v4 addition: per-model threshold calibration. A simple ratio-to-target rule (target rejection rate 2. 0 per trajectory, clamped Eₘin in 0. 8, 4. 0) was applied to 4 LLM families. Results: accuracy harm reduced on 4/4 models, regression harm reduced on 3/4. Llama 3. 1: 8b rejection rate 0. 5→2. 0 (exact target) ; Llama 3: 8b regression delta +0. 30→0. 00 (neutral) ; Dolphin-Llama3: 8b regression delta +0. 10→-0. 10 (sign flip to small reduction) ; GPT-4o-mini regression halved (+0. 80→+0. 40). Contributions: (1) instability phase diagram for self-improving LLM agents (384+32 conditions) ; (2) six-model cross-family evaluation showing audit transfer is not automatic; (3) per-model threshold calibration as a concrete mechanism that reduces accuracy harm on 4/4 tested models and regression harm on 3/4. Change log (v4 vs v3): new calibration subsection (4. 5), new Table 2 (fixed vs calibrated), new Figure 3 (bar plots across 4 models), Related Work expanded with cross-model reproducibility framing, Conclusion updated to reflect 5 findings.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Anonymous Author

Actions

Institutions

American Foundation for the Blind

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

WhyLab v4: Per-Model Threshold Calibration for a Causal Safety Audit of Self-Improving LLM Agents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study