What question did this study set out to answer?

The aim is to establish a framework for monitoring and improving stability in self-improving AI agents.

April 21, 2026Open Access

WhyLab: A Causal Safety Monitoring Framework for Stable Self-Improving Agents

Key Points

The aim is to establish a framework for monitoring and improving stability in self-improving AI agents.
Developed a causal audit framework for unstable regimes.
Created a phase diagram to identify oscillation boundaries in AI policies.
Conducted evaluations using synthetic scenarios and large language model conditions.
Significantly reduced oscillation by 76% under unstable conditions.
Decreased regressions by 44% in adversarial LLM tasks with a fixed sensitivity filter.
Verified that the audit remains inactive in stable situations, as anticipated.

Abstract

Status: NeurIPS 2026 submission under double-blind review. Author identity anonymized. Self-improving AI agents lack runtime safeguards that prevent evaluation drift, fragile outcome acceptance, and unbounded parameter updates from compounding into catastrophic policy degradation. We study cognitive policy oscillation -- strategy degradation caused by hallucinated feedback -- and map an oscillation phase diagram for self-improving agents (384 synthetic + 32 LLM conditions). A sharp instability boundary emerges at moderate step sizes (h approx 0.2), yielding a phase-aware deployment rule. WhyLab: a conditional causal audit framework activating only in the unstable regime: C1: Information-theoretic drift index C2: Sensitivity filter combining E-values and partial R2 bounds C3: Lyapunov-bounded damping controller Our contribution is boundary delineation: identifying when intervention is warranted, not universal improvement. In controlled unstable regimes, the audit reduces oscillation by 76%. On adversarial LLM tasks, fixed C2 reduces regressions by 44% on Gemini 2.0 Flash (p=0.014, Bonferroni-adjusted p=0.042). In the stable regime (SWE-bench Lite, 10,500 episodes), the audit remains inactive, as predicted. Docker evaluations on Gemini 2.0/2.5 Flash show zero observed C2-caused regressions. Change log (v2 vs v1): Abstract condensed to boundary-delineation framing (honest null-result acknowledgement); C2 targeted SWE-bench selective follow-up transparently reported (no net gain vs fixed C2); Docker Gemini 2.5 Flash full Docker evaluation added; phase-aware deployment rule formalized; references and deployment checklist expanded.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Anonymous Author (Sun,) studied this question.

www.synapsesocial.com/papers/69e71423cb99343efc98d8f9 — DOI: https://doi.org/10.5281/zenodo.19063714

WhyLab: A Causal Safety Monitoring Framework for Stable Self-Improving Agents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion