Public AI agent benchmarks report a single scalar: pass rate. We argue this is a lossy projection of five orthogonal axes onto one dimension, and that the missingaxes have first-order consequences for production deployments. We formalize consistency as a five-axis hierarchy of reproducibility measures: exact, syntactic, lexical, behavioral, and decisional. We prove a monotonicity relation and an optimization tradeoff theorem showing that raising pass rate provides no guarantee of non-degradation in orthogonal axes. We demonstrate this formalization with Suite v0: a 50-task, 0/night, MIT-licensed benchmark instrumented with four consistency axes and a cross-family criticconstraint. On a real AI agent (Zeus), a single evaluation run surfaces: (1) 62% of tasks pass yet produce structurally different artifacts on each rerun; (2) twoscorers disagree on the same artifact (evaluator pathology) ; (3) prompt augmentations raising pass rate 88% to 96% concurrently degrade held-out generalization; and (4) client-side determinism (temperature=0, RNG seeding) is insufficient — rerun instability worsens slightly while pass rate improves. Suite v0 is reproducible in five commands on commodity hardware at zero cost.
Building similarity graph...
Analyzing shared references across papers
Loading...
Atakan Akbaba
Building similarity graph...
Analyzing shared references across papers
Loading...
Atakan Akbaba (Tue,) studied this question.
synapsesocial.com/papers/6a0ea196be05d6e3efb6077f — DOI: https://doi.org/10.5281/zenodo.20285100