In this cross-sectional study of 21 LLMs, frontier LLMs achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages. The PrIME-LLM framework provided greater separation than raw accuracy, revealing critical reasoning gaps obscured by traditional benchmarks. Thus, despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning.
Building similarity graph...
Analyzing shared references across papers
Loading...
Arya S. Rao
Kaiz P. Esmail
Richard S. Lee
JAMA Network Open
Harvard University
Brigham and Women's Hospital
Massachusetts General Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Rao et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2b85e4eeef8a2a6b070b — DOI: https://doi.org/10.1001/jamanetworkopen.2026.4003