Key points are not available for this paper at this time.
Multimodal large language models (LLMs) are now deeply integrated into medical education and widely used by medical students, yet it remains unclear whether current models possess the accuracy and reliability needed to support image-based learning. We evaluated four state-of-the-art multimodal LLMs (ChatGPT-5.1, Gemini-2.5, Grok-4, Claude Sonnet-4.5) on 208 image-based examination questions from a Doctor of Medicine program, spanning anatomical pathology (histopathology; 47.6%), radiology (31.7%), and surgical anatomy (20.7%). To isolate visual reasoning, all items were presented in image-only form with contextual information removed. Items covered seven organ systems, included both constructed-response and selected-response formats, and were categorized as recognition-only or recognition-plus-reasoning. ChatGPT-5.1 achieved the highest accuracy (75.5%; 95% CI 69.2-80.8), followed by Gemini-2.5 (59.6%; 95% CI 52.8-66.1), Claude Sonnet-4.5 (41.8%; 95% CI 35.3-48.6), and Grok-4 (34.6%; 95% CI 28.5-41.3). Overall model performance differed significantly (p Gemini > Claude ≈ Grok) across different categories. Accuracy was uniformly higher for recognition-only and selected-response items. Even the best-performing model, ChatGPT-5.1, answered approximately one in four questions incorrectly. These findings suggest that current multimodal LLMs cannot yet replace expert teaching in image-based learning. Their use in medical education should therefore remain supervised and critically appraised, serving as adjuncts rather than authoritative sources.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ming Lu
Josiah Cheng
Vinod Gopalan
Anatomical Sciences Education
Griffith University
Gold Coast Hospital
Logan Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Lu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/6a06b914e7dec685947ab91a — DOI: https://doi.org/10.1002/ase.70256