This review systematically examines evaluation methodologies for agentic AI systems, agentic AI systems capable of multi-step planning, tool usage, and environmental interaction across diverse domains. Current evaluation practices exhibit a critical disconnect between benchmark performance and deployment viability, where agents achieving high scores on standardized benchmarks frequently fail in real world applications due to fundamental inadequacies in assessment methodologies that prioritize task completion over deployment critical dimensions such as cost efficiency, safety compliance, maintainability, and workflow integration. We critically analyze 15 major agent benchmarks including AgentBench, WebArena, SWE-bench, PaperBench, MLGym, BrowserGym, HumanEval, MBPP, GAIA, ToolBench, Terminal-Bench, Mind2Web, ALFWorld, BabyAI, and HotPotQA, examining their methodologies, metrics, datasets, and application domains with software development serving as our primary case study. Our analysis reveals that evaluation methodology not model capability constitutes the primary bottleneck limiting reliable agent deployment. We demonstrate how test passing metrics systematically ignore code quality, security vulnerabilities, and integration complexity while binary success metrics obscure planning coherence, resource efficiency, and safety violations (0/15 benchmarks integrate security or safety into scoring. This review provides a cross-domain taxonomy exposing evaluation inadequacies, trajectory level evaluation frameworks addressing cost reproducibility validity trade-offs, systematic identification of metric insufficiencies including absence of safety-aware and cost aware scoring, and synthesis of emerging evaluation paradigms with adoption barriers. Quantitatively, 0/15 benchmarks integrate safety or security into scoring, 0/15 include cost-efficiency metrics in their primary evaluation protocol, and 13/15 rely exclusively on binary success measures confirming that evaluation methodology, not model capability, is the primary bottleneck to reliable deployment. Progress toward trustworthy agentic AI fundamentally depends on evolving evaluation infrastructure beyond binary metrics toward comprehensive, multidimensional assessment frameworks that capture these deployment-essential dimensions.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kehkashan et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69edacbd4a46254e215b4820 — DOI: https://doi.org/10.1007/s10462-026-11571-0
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Tanzila Kehkashan
Muhammad Abdullah
Ahmad Sami Al-Shamayleh
Artificial Intelligence Review
University of Zagreb
University of Technology Malaysia
Qatar University
Building similarity graph...
Analyzing shared references across papers
Loading...