Purpose To evaluate the performance of five popular large language models (LLMs) in addressing cataract-related queries. Methods This comparative evaluation study was conducted at the Eye and ENT Hospital of Fudan University. We performed both qualitative and quantitative assessments of responses from five LLMs: ChatGPT-4, ChatGPT-4o, Gemini, Copilot, and the open-source Llama 3.5. Model outputs were benchmarked against human-generated responses using seven key metrics: accuracy, completeness, conciseness, harmlessness, readability, stability, and self-correction capability. Additional inter-model comparisons were performed across question subgroups categorized by clinical topic type. Results In the information quality assessment, ChatGPT-4o demonstrated the best performance across most metrics, including accuracy score (6.70 ± 0.63), completeness score (4.63 ± 0.63), and harmlessness score (3.97 ± 0.17). Gemini achieved the highest conciseness score (4.00 ± 0.14). Further subgroup analysis showed that all LLMs performed comparably to or better than humans, regardless of the type of question posed. The readability assessment revealed that ChatGPT-4o had the lowest readability score (26.02 ± 10.78), indicating the highest level of reading difficulty. While Copilot recorded a higher readability score (40.26 ± 14.58) than the other LLMs, it still remained lower than that of humans (51.54 ± 13.71). Copilot also exhibited the best stability in reproducibility and stability assessment. All LLMs demonstrated strong self-correction capability when prompted. Conclusion Our study suggested that LLMs exhibited considerable potential in providing accurate and comprehensive responses to common cataract-related clinical issues. Notably, ChatGPT-4o achieved the best scores in accuracy, completeness, and harmlessness. Despite these promising results, clinicians and patients should be aware of the limitations of artificial intelligence (AI) to ensure critical evaluation in clinical practice.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xinyue Wang
Yan Liu
Linghao Song
Frontiers in Artificial Intelligence
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68c1924e9b7b07f3a0616950 — DOI: https://doi.org/10.3389/frai.2025.1639221