Background: The incidence of ulnar collateral ligament (UCL) repair continues to increase, so evaluating the accuracy and readability of information about this procedure that is produced by artificial intelligence (AI) models is important. This study assesses AI-generated responses to common patient questions about UCL repair.Methods: Twenty patient questions frequently encountered in clinical practice were submitted to ChatGPT, Gemini, and Grok. Three fellowship- trained orthopedic surgeons independently rated answer accuracy using the ChatGPT Response Rating System (CRRS) and AI Response Metric (AIRM), which assign scores from 1–5, with lower scores indicating better accuracy. Responses with CRRS >2 were classified as requiring more than minimal clarification. Readability was evaluated using the Flesch-Kincaid Reading Ease (FKRE) and Grade Level (FKGL) metrics. Responses with an FKGL >6 exceeded the American Medical Association (AMA) and National Institutes of Health (NIH) recommended 6th grade reading level for patient education materials.Results: More than minimal clarification was required for 15% (3/20) of ChatGPT, 5% (1/20) of Gemini, and 40% (8/20) of Grok responses. Gemini (CRRS, 1.5±0.5; AIRM, 1.6±0.5) demonstrated significantly better accuracy than ChatGPT (CRRS, 2.0±0.4; P=0.0002; AIRM, 2.2±0.5; P=0.0001) and Grok (CRRS, 2.1±0.7; P=0.005; AIRM, 2.4±0.8; P=0.002). All responses exceeded the AMA/NIH 6th grade reading level threshold (FKGL >6). Gemini produced the highest FKGL (16.2±2.2), significantly higher than ChatGPT (14.4±1.6, P=0.005) and Grok (14.6±1.7, P=0.017). FKRE did not differ significantly among models (P=0.14).Conclusions: AI models generated generally accurate information about UCL repair but at reading levels far above the AMA/NIH recommendations. In this study, Gemini was the most accurate model and produced the least readable content.Level of evidence: III.
King et al. (Fri,) studied this question.