Background: To evaluate and compare the performance of four large language models (LLMs)—ChatGPT, DeepSeek, Gemini, and Copilot—in answering frequently asked patient questions on laser refractive surgery.Methods: This cross-sectional, non-clinical study evaluated 25 patient-centered refractive surgery questions posed to four LLMs. Two ophthalmologists independently rated response accuracy and completeness using Likert scales. Information quality was assessed using the DISCERN instrument, and readability using the Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL). Statistical analysis included the Friedman test with Wilcoxon signed-rank post-hoc comparisons using Bonferroni cor-rection. Cohen’s kappa assessed inter-rater reliability.Results: Inter-rater agreement was substantial for accuracy (κ = 0.650, p 0.001) and moderate for completeness (κ = 0.533, p 0.001). ChatGPT and DeepSeek achieved the highest accuracy and completeness scores with no significant difference between them. Copilot performed significantly worse than both (p = 0.003 and p = 0.031, respectively), while Gemini showed interme-diate performance. DISCERN scores placed all models in the good range (54–58/75). When prompted to provide references, DeepSeek showed the greatest improvement (+7 points), reaching the outstanding category. All models produced responses in the “difficult” readability range; DeepSeek generated the most accessible text (FRE = 45.5; FKGL = 9.1), whereas Gemini required the highest reading level (FRE = 35.2; FKGL = 12.7).Conclusion: Large language models can provide reasonably accurate responses to refractive surgery–related patient questions. However, variability in information quality and readability highlights the importance of clinician oversight when using these tools for patient education.
Arıbaş et al. (Mon,) studied this question.