What question did this study set out to answer?

The aim is to improve AI agent evaluation by formalizing consistency across multiple dimensions instead of relying solely on pass rate.

May 21, 2026Open Access

Formalizing Consistency in AI Agent Evaluation: Suite v0

Leer artículo completoexternamente

Puntos clave

The aim is to improve AI agent evaluation by formalizing consistency across multiple dimensions instead of relying solely on pass rate.
Developed Suite v0, a 50-task benchmark for AI agents, based on a five-axis reproducibility hierarchy.
Implemented evaluations on an AI agent (Zeus) using four consistency axes and a cross-family critic constraint.
Conducted assessments to compare pass rates against artifact stability and scorer agreement.
62% of tasks pass, yet show structural variations upon rerun.
Disagreement observed between two scorers on identical artifacts, indicating evaluator pathology.
Increasing pass rate from 88% to 96% led to a decline in held-out generalization despite improved pass rate.

Resumen

Public AI agent benchmarks report a single scalar: pass rate. We argue this is a lossy projection of five orthogonal axes onto one dimension, and that the missingaxes have first-order consequences for production deployments. We formalize consistency as a five-axis hierarchy of reproducibility measures: exact, syntactic, lexical, behavioral, and decisional. We prove a monotonicity relation and an optimization tradeoff theorem showing that raising pass rate provides no guarantee of non-degradation in orthogonal axes. We demonstrate this formalization with Suite v0: a 50-task, 0/night, MIT-licensed benchmark instrumented with four consistency axes and a cross-family criticconstraint. On a real AI agent (Zeus), a single evaluation run surfaces: (1) 62% of tasks pass yet produce structurally different artifacts on each rerun; (2) twoscorers disagree on the same artifact (evaluator pathology) ; (3) prompt augmentations raising pass rate 88% to 96% concurrently degrade held-out generalization; and (4) client-side determinism (temperature=0, RNG seeding) is insufficient — rerun instability worsens slightly while pass rate improves. Suite v0 is reproducible in five commands on commodity hardware at zero cost.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Atakan Akbaba

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Formalizing Consistency in AI Agent Evaluation: Suite v0

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study