What question did this study set out to answer?

This study aims to systematically benchmark the performance of agentic AI systems in clinical decision-making tasks.

February 20, 2026Open Access

Benchmarking large language model-based agent systems for clinical decision tasks

Key Points

This study aims to systematically benchmark the performance of agentic AI systems in clinical decision-making tasks.
Evaluated two systems: OpenManus and Manus.
Assessed performance across three benchmark families: AgentClinic, MedAgentsBench, and Humanity’s Last Exam.
Measured accuracy, token usage, and latency in various tasks.
OpenManus and Manus achieved accuracy rates of 60.3% in AgentClinic, 30.3% in MedAgentsBench, and 8.6% in HLE text.
Multimodal accuracy was low, with 15.5% on HLE and 29.2% on AgentClinic NEJM.
Resource demands increased significantly, with over 10× token usage and over 2× latency compared to baseline models.

Abstract

Abstract Agentic artificial intelligence (AI) systems, designed to autonomously reason, plan, and invoke tools, have shown promise in healthcare, yet systematic benchmarking of their real-world performance remains limited. In this study, we evaluate two such systems: the open-source OpenManus, built on Meta’s Llama-4 and extended with medically customized agents; and Manus, a proprietary agent system employing a multistep planner-executor-verifier architecture. Both systems were assessed across three benchmark families: AgentClinic , a stepwise dialog-based diagnostic simulation; MedAgentsBench , a knowledge-intensive medical QA dataset; and Humanity’s Last Exam (HLE), a suite of challenging text-only and multimodal questions. Despite access to advanced tools (e.g., web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE text. Multimodal accuracy remained low (15.5% on multimodal HLE, 29.2% on AgentClinic NEJM), while resource demands increased substantially, with >10× token usage and >2× latency. Although 89.9% of hallucinations were filtered by in-agent safeguards, hallucinations remained prevalent. These findings reveal that current agentic designs offer modest performance benefits at significant computational and workflow cost, underscoring the need for more accurate, efficient, and clinically viable agent systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Liu et al. (Wed,) studied this question.

www.synapsesocial.com/papers/6997faddad1d9b11b3453f83 — DOI: https://doi.org/10.1038/s41746-026-02443-6

Authors

Yunsong Liu

Zunamys I. Carrero

Xiaofeng Jiang

Journals

npj Digital Medicine

Actions

Institutions

Heidelberg University

University of Leeds

University Hospital Heidelberg

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Benchmarking large language model-based agent systems for clinical decision tasks

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion