What question did this study set out to answer?

The study aims to evaluate the accuracy and reliability of four large language models used in ophthalmology for patient education.

April 10, 2026Open Access

Digital guides in eye care: Comparing AI model accuracy and reliability

Key Points

The study aims to evaluate the accuracy and reliability of four large language models used in ophthalmology for patient education.
Conducted a cross-sectional evaluation of responses to 50 frequently asked patient questions.
Included questions covering five ophthalmic subspecialties.
Responses were generated by four large language models: ChatGPT, Gemini, Claude, and LLaMA.
Five blinded ophthalmologists evaluated the responses using a 10-point scale.
Unsafe content was identified and categorized based on a structured error taxonomy.
Significant performance differences among the models were observed.
Mean accuracy scores were: 3.44 for Gemini, 2.99 for ChatGPT, 2.48 for Claude, and 1.09 for LLaMA.
Gemini generally outperformed the other models across most subspecialties.
In the retina subspecialty, ChatGPT and Claude performed relatively well.
19 out of 200 responses (9.5%) contained potentially unsafe content, with LLaMA showing the highest proportion.

Abstract

Objectives The aim of this study was to comparatively evaluate four large language models (LLMs) used for patient education in ophthalmology in terms of accuracy, reliability, and patient safety across different ophthalmic subspecialties. Methods In this cross-sectional evaluation, a total of 50 frequently asked patient questions covering five ophthalmic subspecialties (strabismus/pediatric ophthalmology, oculoplastics, cataract and refractive surgery, retina, and dry eye) were included. All questions were submitted in a text-only format to ChatGPT o3 Mini High, Gemini 2.0 Pro, Claude-Sonnet 3.7, and LLaMA 3.1 405B. The generated responses were independently evaluated by five blinded ophthalmologists using a 10-point scale assessing accuracy, currency, informativeness/clarity, and patient safety. Potentially unsafe content was identified and categorized using a predefined structured error taxonomy. Results Marked differences in performance were observed among the models. Mean scores were 3.44 for Gemini, 2.99 for ChatGPT, 2.48 for Claude, and 1.09 for LLaMA. Gemini demonstrated higher performance across most subspecialties, whereas in the retina subspecialty, ChatGPT and Claude generated comparatively stronger responses. Of the 200 evaluated responses, 19 (9.5%) contained potentially unsafe content, with the lowest proportion observed for Gemini and the highest for LLaMA. Conclusions LLMs can generate useful responses for patient education in ophthalmology, but performance varies by model and subspecialty. Within this 50-question, text-only expert-rating framework, Gemini 2.0 Pro and ChatGPT o3 Mini High provided relatively higher accuracy and reliability in most areas, whereas LLaMA 3.1 405B lagged. Larger and clinically integrated evaluations, including direct assessment of patient understanding and behavior, are needed to define their safe use in practice.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Savaş et al. (Sun,) studied this question.

www.synapsesocial.com/papers/69d895d86c1944d70ce06ffc — DOI: https://doi.org/10.1177/20552076261433657

Authors

Hakan Veli Savaş

Osman Altay

Journals

Digital Health

Actions

Institutions

Manisa Celal Bayar University

University of Kara

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Digital guides in eye care: Comparing AI model accuracy and reliability

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion