What does this research mean for the field?

Gemini generates the most accurate responses to patient questions about ulnar collateral ligament repair compared to ChatGPT and Grok, but all models produce content that exceeds the recommended reading level for patient education. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to assess the accuracy and readability of artificial intelligence responses to patient questions regarding ulnar collateral ligament repair.

February 28, 2026Open Access

Evaluating large language model responses to patient questions on ulnar collateral ligament repair

Key Points

The aim is to assess the accuracy and readability of artificial intelligence responses to patient questions regarding ulnar collateral ligament repair.
Submitted 20 patient questions to ChatGPT, Gemini, and Grok.
Used the ChatGPT Response Rating System (CRRS) and AI Response Metric (AIRM) for accuracy ratings.
Evaluated readability using Flesch-Kincaid Reading Ease and Grade Level metrics.
15% of ChatGPT's answers, 5% of Gemini's, and 40% of Grok's required more clarification.
Gemini was the most accurate model compared to ChatGPT and Grok with lower CRRS and AIRM scores.
All models provided responses exceeding the 6th grade reading level recommended for patient materials.

Abstract

Background: The incidence of ulnar collateral ligament (UCL) repair continues to increase, so evaluating the accuracy and readability of information about this procedure that is produced by artificial intelligence (AI) models is important. This study assesses AI-generated responses to common patient questions about UCL repair.Methods: Twenty patient questions frequently encountered in clinical practice were submitted to ChatGPT, Gemini, and Grok. Three fellowship- trained orthopedic surgeons independently rated answer accuracy using the ChatGPT Response Rating System (CRRS) and AI Response Metric (AIRM), which assign scores from 1–5, with lower scores indicating better accuracy. Responses with CRRS >2 were classified as requiring more than minimal clarification. Readability was evaluated using the Flesch-Kincaid Reading Ease (FKRE) and Grade Level (FKGL) metrics. Responses with an FKGL >6 exceeded the American Medical Association (AMA) and National Institutes of Health (NIH) recommended 6th grade reading level for patient education materials.Results: More than minimal clarification was required for 15% (3/20) of ChatGPT, 5% (1/20) of Gemini, and 40% (8/20) of Grok responses. Gemini (CRRS, 1.5±0.5; AIRM, 1.6±0.5) demonstrated significantly better accuracy than ChatGPT (CRRS, 2.0±0.4; P=0.0002; AIRM, 2.2±0.5; P=0.0001) and Grok (CRRS, 2.1±0.7; P=0.005; AIRM, 2.4±0.8; P=0.002). All responses exceeded the AMA/NIH 6th grade reading level threshold (FKGL >6). Gemini produced the highest FKGL (16.2±2.2), significantly higher than ChatGPT (14.4±1.6, P=0.005) and Grok (14.6±1.7, P=0.017). FKRE did not differ significantly among models (P=0.14).Conclusions: AI models generated generally accurate information about UCL repair but at reading levels far above the AMA/NIH recommendations. In this study, Gemini was the most accurate model and produced the least readable content.Level of evidence: III.

Bookmark

View Full Paper

Cite This Study

King et al. (Fri,) studied this question.

synapsesocial.com/papers/69a288170a974eb0d3c040d5 https://doi.org/https://doi.org/10.5397/cise.2025.01214

Bookmark

View Full Paper