Background: Periprosthetic joint infection (PJI) remains a devastating complication following arthroplasty. Systematic reviews of PJI provide essential evidence to inform clinical practice; however, the screening process remains labor-intensive. Recent advancements in large language models (LLMs) offer potential for automating literature screening, though evaluation of current generation models is needed. Methods: This validation study evaluated GPT-5, GPT-5 Pro, and Gemini 2.5 Pro in replicating the title/abstract and full-text screening stages of a published systematic review on intraosseous versus intravenous antibiotic prophylaxis in total joint arthroplasty. Title/abstract screening was performed on 165 articles, followed by a full-text eligibility assessment of 26 articles. Accuracy, sensitivity, specificity, and Cohen’s kappa (κ) were calculated against human screening decisions as the gold standard. Results: In title/abstract screening, GPT-5 Pro achieved the highest accuracy (92.1–92.7%) and specificity (98.6–99.3%), while GPT-5 demonstrated the highest sensitivity (84.6–96.1%). In full-text screening, Gemini 2.5 Pro showed the most consistent performance across repeated evaluations (κ = 0.839 in both trials), whereas GPT-5 Pro exhibited marked intra-model variability (κ = 0.399 to 0.920). Conclusions: Current-generation LLMs achieve near-human accuracy in systematic review screening for PJI research, though substantial intra-model variability underscores the continued need for human oversight in systematic review workflows.
Building similarity graph...
Analyzing shared references across papers
Loading...
Woojin Shin
Jaeyoung Hong
Sunwoo Lee
Journal of Clinical Medicine
Chosun University
Chosun University Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Shin et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d896566c1944d70ce07a4f — DOI: https://doi.org/10.3390/jcm15082830