v0. 9 — calibration revision. Removed "breakthrough" language throughout (abstract, §9. 1, §10, §13 conclusion) in favor of more measured phrasing ("most promising candidate signal in this work"). Conclusion now explicitly flags the W/S ≈ 0. 41 ratio match between Qwen and Llama as the most suggestive pattern that merits independent replication on additional models before being treated as a discovery, and explicitly notes the N=20 sample size on the Qwen primary dataset and production-scale FP cost implications. Mid-layer ablation claim hedged between edge-layer specificity and layer-count effect interpretations. Same 6-detector arc, same numeric results, same scope (multi-turn code-execution agent tasks with rich action spaces). Original v1 PDF remains accessible at doi. org/10. 5281/zenodo. 20368601. Paper summary: We identify a 34% blind spot in probe-based LLM agent failure monitoring on Qwen3. 6-27B SWE-bench Pro: the WANDERING sub-class where probe says "success" but agent never emits finishₜool. We test six detector designs across three signal channels (text, residual cross-layer, action entropy). The most promising candidate is tool-use entropy collapse: WANDERING agents collapse onto a small set of repeated tool calls (W/S median ratio ≈ 0. 41 in Qwen and Llama, 0. 71 in GPT-5), enabling a Tier-3 autonomous-termination detector at 70% recall × 5% false-positive rate on the primary dataset. Cross-architecture validation: Llama-70b (n=2, 315, p<10⁻¹⁵, ratio ≈0. 41) and GPT-5 router (n=1, 419, p=8. 9×10⁻³⁵, ratio ≈0. 71) confirm direction. Cross-task validation on METR MALT (15+ task families) is NULL (p=0. 81), scoping the claim to multi-turn code-execution agent tasks with rich action spaces. Reproducibility: all code, per-trajectory output JSONs, and figure-generation scripts at GitHub under Apache-2. 0. OpenInterp Phase 6 dataset (99 trajectories × per-turn residuals at L11/L23/L31/L43/L55 in bf16 safetensors) will be released at HuggingFace upon paper acceptance.
Caio Vicentino (Sun,) studied this question.