Background: Periprosthetic joint infection (PJI) remains a devastating complication following arthroplasty. Systematic reviews of PJI provide essential evidence to inform clinical practice; however, the screening process remains labor-intensive. Recent advancements in large language models (LLMs) offer potential for automating literature screening, though evaluation of current generation models is needed. Methods: This validation study evaluated GPT-5, GPT-5 Pro, and Gemini 2.5 Pro in replicating the title/abstract and full-text screening stages of a published systematic review on intraosseous versus intravenous antibiotic prophylaxis in total joint arthroplasty. Title/abstract screening was performed on 165 articles, followed by a full-text eligibility assessment of 26 articles. Accuracy, sensitivity, specificity, and Cohen’s kappa (κ) were calculated against human screening decisions as the gold standard. Results: In title/abstract screening, GPT-5 Pro achieved the highest accuracy (92.1–92.7%) and specificity (98.6–99.3%), while GPT-5 demonstrated the highest sensitivity (84.6–96.1%). In full-text screening, Gemini 2.5 Pro showed the most consistent performance across repeated evaluations (κ = 0.839 in both trials), whereas GPT-5 Pro exhibited marked intra-model variability (κ = 0.399 to 0.920). Conclusions: Current-generation LLMs achieve near-human accuracy in systematic review screening for PJI research, though substantial intra-model variability underscores the continued need for human oversight in systematic review workflows.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shin et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d896566c1944d70ce07a4f — DOI: https://doi.org/10.3390/jcm15082830
Woojin Shin
Jaeyoung Hong
Sunwoo Lee
Journal of Clinical Medicine
Chosun University
Chosun University Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...