What type of study is this?

This is a Comparative Evaluation study.

September 10, 2025Open Access

Transforming cataract care through artificial intelligence: an evaluation of large language models’ performance in addressing cataract-related queries

Key Points

ChatGPT-4o achieved the highest accuracy and completeness scores among the evaluated models.
In the evaluation, all large language models outperformed human responses in several metrics, demonstrating promising capabilities.
The study utilized both qualitative and quantitative assessments to benchmark model outputs against human-generated responses.
Despite high performance, clinicians and patients must consider the limitations of artificial intelligence in clinical practice.

Abstract

Purpose To evaluate the performance of five popular large language models (LLMs) in addressing cataract-related queries. Methods This comparative evaluation study was conducted at the Eye and ENT Hospital of Fudan University. We performed both qualitative and quantitative assessments of responses from five LLMs: ChatGPT-4, ChatGPT-4o, Gemini, Copilot, and the open-source Llama 3.5. Model outputs were benchmarked against human-generated responses using seven key metrics: accuracy, completeness, conciseness, harmlessness, readability, stability, and self-correction capability. Additional inter-model comparisons were performed across question subgroups categorized by clinical topic type. Results In the information quality assessment, ChatGPT-4o demonstrated the best performance across most metrics, including accuracy score (6.70 ± 0.63), completeness score (4.63 ± 0.63), and harmlessness score (3.97 ± 0.17). Gemini achieved the highest conciseness score (4.00 ± 0.14). Further subgroup analysis showed that all LLMs performed comparably to or better than humans, regardless of the type of question posed. The readability assessment revealed that ChatGPT-4o had the lowest readability score (26.02 ± 10.78), indicating the highest level of reading difficulty. While Copilot recorded a higher readability score (40.26 ± 14.58) than the other LLMs, it still remained lower than that of humans (51.54 ± 13.71). Copilot also exhibited the best stability in reproducibility and stability assessment. All LLMs demonstrated strong self-correction capability when prompted. Conclusion Our study suggested that LLMs exhibited considerable potential in providing accurate and comprehensive responses to common cataract-related clinical issues. Notably, ChatGPT-4o achieved the best scores in accuracy, completeness, and harmlessness. Despite these promising results, clinicians and patients should be aware of the limitations of artificial intelligence (AI) to ensure critical evaluation in clinical practice.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Xinyue Wang

Yan Liu

Linghao Song

Journals

Frontiers in Artificial Intelligence

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Transforming cataract care through artificial intelligence: an evaluation of large language models’ performance in addressing cataract-related queries

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study