A benchmark for the failure mode nobody measures: confident AI action taken before the evidence justifies it. We introduce Structural Admissibility as a measurable cognitive property — the capacity to determine whether the current epistemic state justifies committing to an action. Grounded in the Expanding Uncertainty Threshold (EUT) framework, this benchmark evaluates four metacognitive behaviors in frontier language models: instability detection, false resolution resistance, clarification quality, and action gating. Evaluation of 7 frontier models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, Claude 3 Haiku, Gemini 2.0 Flash, Llama 3.1 70B, Mistral Large 2) across 150 benchmark items spanning 5 task families. Human baseline: 42 graduate-level participants. Best model (Claude 3.5 Sonnet) CAS: 0.843 vs human baseline 0.887. Submitted to: Kaggle — Measuring Progress Toward AGI — Cognitive Abilities Hackathon (Metacognition track), March 2026.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ivan Andrescov
Building similarity graph...
Analyzing shared references across papers
Loading...
Ivan Andrescov (Mon,) studied this question.
www.synapsesocial.com/papers/69df2b85e4eeef8a2a6b0844 — DOI: https://doi.org/10.5281/zenodo.19556747