What question did this study set out to answer?

This study aims to compare the accuracy and readability of responses from several large language models addressing patient questions on refractive surgery.

June 3, 2026

Performance of Deepseek vs. Established Large Language Models in Answering Frequently Asked Questions About Refractive Surgery

Key Points

This study aims to compare the accuracy and readability of responses from several large language models addressing patient questions on refractive surgery.
Evaluated 25 patient-centered questions using four large language models: ChatGPT, DeepSeek, Gemini, and Copilot.
Responses rated by two ophthalmologists for accuracy and completeness on Likert scales.
Statistical analyses involved Friedman test and Wilcoxon signed-rank post-hoc comparisons with Bonferroni correction.
DeepSeek and ChatGPT achieved the highest scores for accuracy and completeness, with substantial inter-rater agreement for accuracy (κ = 0.650, p < 0.001).
Copilot significantly underperformed compared to the other models (p = 0.003 and p = 0.031).
DeepSeek generated the most readable text while Gemini had the highest grade level requirement according to the FRE and FKGL metrics.

Abstract

Background: To evaluate and compare the performance of four large language models (LLMs)—ChatGPT, DeepSeek, Gemini, and Copilot—in answering frequently asked patient questions on laser refractive surgery.Methods: This cross-sectional, non-clinical study evaluated 25 patient-centered refractive surgery questions posed to four LLMs. Two ophthalmologists independently rated response accuracy and completeness using Likert scales. Information quality was assessed using the DISCERN instrument, and readability using the Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL). Statistical analysis included the Friedman test with Wilcoxon signed-rank post-hoc comparisons using Bonferroni cor-rection. Cohen’s kappa assessed inter-rater reliability.Results: Inter-rater agreement was substantial for accuracy (κ = 0.650, p 0.001) and moderate for completeness (κ = 0.533, p 0.001). ChatGPT and DeepSeek achieved the highest accuracy and completeness scores with no significant difference between them. Copilot performed significantly worse than both (p = 0.003 and p = 0.031, respectively), while Gemini showed interme-diate performance. DISCERN scores placed all models in the good range (54–58/75). When prompted to provide references, DeepSeek showed the greatest improvement (+7 points), reaching the outstanding category. All models produced responses in the “difficult” readability range; DeepSeek generated the most accessible text (FRE = 45.5; FKGL = 9.1), whereas Gemini required the highest reading level (FRE = 35.2; FKGL = 12.7).Conclusion: Large language models can provide reasonably accurate responses to refractive surgery–related patient questions. However, variability in information quality and readability highlights the importance of clinician oversight when using these tools for patient education.

Bookmark

Performance of Deepseek vs. Established Large Language Models in Answering Frequently Asked Questions About Refractive Surgery

Key Points

Abstract

Cite This Study