Self-improving AI agents lack runtime safeguards that prevent evaluation drift, fragile outcome acceptance, and unbounded parameter updates from compounding into catastrophic policy degradation. WhyLab introduces a causal audit framework comprising three complementary defenses: C1: Information-theoretic drift detection across evaluation streams C2: E-value × Robustness Value dual-threshold filter for fragile outcomes C3: Lyapunov-bounded adaptive damping with observable energy proxy Experiments on synthetic environments demonstrate that C1 improves within-horizon detection reliability, C2 substantially reduces fragile acceptance rates, and C3 achieves the lowest violation frequency with strong proxy–state alignment. Code: https://github.com/neogenesislab/WhyLab-NeurIPS2026
Building similarity graph...
Analyzing shared references across papers
Loading...
Anonymous (Wed,) studied this question.
www.synapsesocial.com/papers/69b3ac2b02a1e69014ccda8e — DOI: https://doi.org/10.5281/zenodo.18948929
Anonymous
Building similarity graph...
Analyzing shared references across papers
Loading...