What type of study is this?

This is a Systematic Review study (also classified as: Findings Consistent with Meta-Analysis).

August 26, 2025Open Access

AI Agents in Clinical Medicine: A Systematic Review

Key Points

AI agents improved performance in clinical tasks significantly, with some showing over 60 percentage points increase compared to standard models.
The review included twenty studies that demonstrated superior accuracy in AI agent systems versus baseline large language models in various tasks.
AI agents effectively handled discrete tasks like medication dosing and evidence retrieval, particularly performing well in complex scenarios.
Future research calls for prospective trials that use real-world data to validate safety and cost-effectiveness of these AI systems.

Abstract

Background: AI agents built on large language models (LLMs) can plan tasks, use external tools, and coordinate with other agents. Unlike standard LLMs, agents can execute multi-step processes, access real-time clinical information, and integrate multiple data sources. There has been interest in using such agents for clinical and administrative tasks, however, there is limited knowledge on their performance and whether multi-agent systems function better than a single agent for healthcare tasks. Purpose: To evaluate the performance of AI agents in healthcare, compare AI agent systems vs. standard LLMs and catalog the tools used for task completion Data Sources: PubMed, Web of Science, and Scopus from October 1, 2022, through August 5, 2025. Study Selection: Peer-reviewed studies implementing AI agents for clinical tasks with quantitative performance comparisons. Data Extraction: Two reviewers (A.G., M.O.) independently extracted data on architectures, performance metrics, and clinical applications. Discrepancies were resolved by discussion, with a third reviewer (E.K.) consulted when consensus could not be reached. Data Synthesis: Twenty studies met inclusion criteria. Across studies, all agent systems outperformed their baseline LLMs in accuracy performance. Improvements ranged from small gains to increases of over 60 percentage points, with a median improvement of 53 percentage points in single-agent tool-calling studies. These systems were particularly effective for discrete tasks such as medication dosing and evidence retrieval. Multi-agent systems showed optimal performance with up to 5 agents, and their effectiveness was particularly pronounced when dealing with highly complex tasks. The highest performance boost occurred when the complexity of the AI agent framework aligned with that of the task. Limitations: Heterogeneous outcomes precluded quantitative meta-analysis. Several studies relied on synthetic data, limiting generalizability. Conclusions: AI agents consistently improve clinical task performance of Base-LLMs when architecture matches task complexity. Our analysis indicates a step-change over base-LLMs, with AI agents opening previously inaccessible domains. Future efforts should be based on prospective, multi-center trials using real-world data to determine safety, task matched and cost-effectiveness.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Alon Gorenshtein

Mahmud Omar

Benjamin S. Glicksberg

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AI Agents in Clinical Medicine: A Systematic Review

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider