May 15, 2026Open Access

Performance of multimodal large language models on image‐based surgical anatomy, anatomical pathology, and radiology questions

Key Points

Key points are not available for this paper at this time.

Abstract

Multimodal large language models (LLMs) are now deeply integrated into medical education and widely used by medical students, yet it remains unclear whether current models possess the accuracy and reliability needed to support image-based learning. We evaluated four state-of-the-art multimodal LLMs (ChatGPT-5.1, Gemini-2.5, Grok-4, Claude Sonnet-4.5) on 208 image-based examination questions from a Doctor of Medicine program, spanning anatomical pathology (histopathology; 47.6%), radiology (31.7%), and surgical anatomy (20.7%). To isolate visual reasoning, all items were presented in image-only form with contextual information removed. Items covered seven organ systems, included both constructed-response and selected-response formats, and were categorized as recognition-only or recognition-plus-reasoning. ChatGPT-5.1 achieved the highest accuracy (75.5%; 95% CI 69.2-80.8), followed by Gemini-2.5 (59.6%; 95% CI 52.8-66.1), Claude Sonnet-4.5 (41.8%; 95% CI 35.3-48.6), and Grok-4 (34.6%; 95% CI 28.5-41.3). Overall model performance differed significantly (p Gemini > Claude ≈ Grok) across different categories. Accuracy was uniformly higher for recognition-only and selected-response items. Even the best-performing model, ChatGPT-5.1, answered approximately one in four questions incorrectly. These findings suggest that current multimodal LLMs cannot yet replace expert teaching in image-based learning. Their use in medical education should therefore remain supervised and critically appraised, serving as adjuncts rather than authoritative sources.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ming Lu

Josiah Cheng

Vinod Gopalan

Journals

Anatomical Sciences Education

Actions

Institutions

Griffith University

Gold Coast Hospital

Logan Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Performance of multimodal large language models on image‐based surgical anatomy, anatomical pathology, and radiology questions

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study