March 3, 2026Open Access

Accuracy and Reproducibility of Different Artificial Intelligence Chatbots Responses to Patient-Based Vitreoretinal Questions: A Comparative Study

Key Points

ChatGPT-5.o achieved 94% accuracy and 96.3% reproducibility in responses.
DeepSeek R1 showed the highest reproducibility at 98.5% with 92.6% accuracy.
Meta AI had 91% accuracy and 94% reproducibility, while Grok 3.0 was the least accurate at 49.6%.
Significant variability in chatbot performance suggests careful clinical adoption is necessary.

Abstract

Motasem Al-latayfeh,1,2 Abdelwahab Aleshawi,3 Omar S El-Mulki,4 Mohammed Baker,5 Zaina Qaddoumi,6 Dalia Attar,7 Lina Almaâaitah,8 Elaf Z Jarrah,5 Zainah Abu Khalil,5 Walaa Awad,3 Moâmen Raed Dayeh,3 Seren Al Beiruti,3 Rami Al-Dwairi3 1Department of Special Surgery, Faculty of Medicine, the Hashemite University, Zarqa, Jordan; 2Department of Ophthalmology, Prince Hamza Hospital, Amman, Jordan; 3Ophthalmology Division, Department of Special Surgery, Faculty of Medicine, Jordan University of Science and Technology, Irbid, Jordan; 4Department of Ophthalmology, Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, FL, USA; 5Faculty of Medicine, Jordan University of Science and Technology, Irbid, Jordan; 6Faculty of Science, University of Jordan, Amman, Jordan; 7Faculty of Medicine, Hashemite University, Zarqa, Jordan; 8Faculty of Pharmacy, Hashemite University, Zarqa, JordanCorrespondence: Motasem Al-latayfeh, Department of Special Surgery, Faculty of Medicine, the Hashemite University, Zarqa, Jordan, Email Motasem974@gmail.com Rami Al-Dwairi, Ophthalmology Division, Department of Special Surgery, Faculty of Medicine, Jordan University of Science and Technology, Irbid, Jordan, Email ramialdwairi@yahoo.comBackground: Generative artificial intelligence (AI) chatbots are increasingly used by patients and their reliability in complex ophthalmic conditions remains uncertain. This study aimed to compare the accuracy, comprehensiveness, and reproducibility of five AI chatbotsâChatGPT-5.o, DeepSeek R1, Meta AI, Grok 3.0, and Google Gemini 2.5 Proâin responding to patient-centered vitreoretinal questions.Methods: A total of 135 questions covering diabetic retinopathy, floaters/flashes, age-related macular degeneration, retinal tear/detachment, and vitrectomy were sourced from the American Academy of Ophthalmology âAsk an Ophthalmologistâ database. Each question was submitted twice to each chatbot under standardized instructions. Two board-certified vitreoretinal ophthalmologists independently graded responses for accuracy and reproducibility. Accuracy was calculated as the proportion of responses graded âCorrect and Comprehensiveâ or âAccurate but incompleteâ; reproducibility was defined as agreement between the two responses.Results: ChatGPT-5.o achieved the highest overall accuracy (94%, n=127/135, 95% CI: 89.9%â 98.1%) with a reproducibility rate of 96.3% (n=130/135, 95% CI: 93.1%â 99.5%). DeepSeek R1 demonstrated the greatest reproducibility (98.5%, n=133/135, 95% CI: 96.5%â 100.0%) and high accuracy (92.6%, n=125/135, 95% CI: 88.1%â 97.1%). Meta AI showed 91% (95% CI: 86.1%â 95.9%) accuracy and 94% (95% CI: 89.9%â 98.1%) reproducibility, whereas Grok 3.0 yielded the lowest accuracy (49.6%, n=67/135, 95% CI: 41.2%â 58.0%) despite moderate reproducibility (88.1%, n=119/135, 95% CI: 82.7%â 93.5%). Google Gemini 2.5 Pro recorded 72.6% (95% CI: 65.1%â 80.1%) accuracy and the lowest reproducibility (77%, 95% CI: 69.9%â 84.1%). By category, âVitrectomyâ scored the highest across all chatbots (94%, 95% CI: 87.2%â 100.0%), followed by âMacular degenerationâ (90%, 95% CI: 85.0%â 95.0%). However, the category âDiabetic retinopathyâ scored the lowest accuracy rate (64.7%, 95% CI: 52.1%â 77.3%).Conclusion: ChatGPT-5.o and DeepSeek R1 approached high accuracy and reproducibility comparable to clinical standards, indicating potential as patient-education tools in vitreoretinal care. However, variability across models and disease categories highlights the need for cautious clinical adoption and continued optimization to ensure safe, reliable information delivery.Keywords: vitreoretinal surgery, diabetic retinopathy, artificial intelligence, large language models

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Motasem Al-Latayfeh

Abdelwahab Aleshawi

Omar S El-Mulki

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Accuracy and Reproducibility of Different Artificial Intelligence Chatbots Responses to Patient-Based Vitreoretinal Questions: A Comparative Study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study