What question did this study set out to answer?

This analysis aims to compare the performance of large language models to postgraduate residents on an orthopaedic examination.

April 10, 2026

Large Language Models Outperform PGY-5 Residents on the Orthopaedic In-Training Examination: A Comparative Analysis of Six Cutting-Edge Large Language Models

Key Points

This analysis aims to compare the performance of large language models to postgraduate residents on an orthopaedic examination.
Compared performance of ChatGPT and other large language models against PGY-5 resident scores on the orthopaedic examination.
Evaluated accuracy and reasoning quality across multiple metrics.
ChatGPT scored highest in accuracy, matching PGY-5 resident level.
Performance across all metrics was superior in large language models compared to resident averages.

Abstract

ChatGPT consistently scored the highest in terms of accuracy across all metrics while also maintaining reasoning quality. Compared with resident averages, ChatGPT performed at a postgraduate year five level which indicates its potential for integration into orthopaedic clinics, electronic medical records, and surgical planning. Further development models would allow for better performance on difficult questions and creating orthopaedic focused models could enhance these results.

Bookmark

Cite This Study

Dave et al. (Wed,) studied this question.

synapsesocial.com/papers/69d895a86c1944d70ce06c1b https://doi.org/https://doi.org/10.5435/jaaos-d-25-01242

Bookmark