What question did this study set out to answer?

This study aims to evaluate the performance of large language models in screening systematic reviews for periprosthetic joint infection.

April 10, 2026Open Access

Efficacy of Large Language Models for Screening of Systematic Reviews on Periprosthetic Joint Infection

Key Points

This study aims to evaluate the performance of large language models in screening systematic reviews for periprosthetic joint infection.
Evaluated GPT-5, GPT-5 Pro, and Gemini 2.5 Pro for title/abstract and full-text screening.
Performed title/abstract screening on 165 articles and full-text screening on 26 articles.
Calculated accuracy, sensitivity, specificity, and Cohen’s kappa against human decisions.
GPT-5 Pro achieved the highest accuracy (92.1–92.7%) and specificity (98.6–99.3%).
GPT-5 demonstrated the highest sensitivity (84.6–96.1%).
Gemini 2.5 Pro exhibited consistent performance with Cohen’s kappa of 0.839.
GPT-5 Pro showed significant intra-model variability (κ ranging from 0.399 to 0.920).

Abstract

Background: Periprosthetic joint infection (PJI) remains a devastating complication following arthroplasty. Systematic reviews of PJI provide essential evidence to inform clinical practice; however, the screening process remains labor-intensive. Recent advancements in large language models (LLMs) offer potential for automating literature screening, though evaluation of current generation models is needed. Methods: This validation study evaluated GPT-5, GPT-5 Pro, and Gemini 2.5 Pro in replicating the title/abstract and full-text screening stages of a published systematic review on intraosseous versus intravenous antibiotic prophylaxis in total joint arthroplasty. Title/abstract screening was performed on 165 articles, followed by a full-text eligibility assessment of 26 articles. Accuracy, sensitivity, specificity, and Cohen’s kappa (κ) were calculated against human screening decisions as the gold standard. Results: In title/abstract screening, GPT-5 Pro achieved the highest accuracy (92.1–92.7%) and specificity (98.6–99.3%), while GPT-5 demonstrated the highest sensitivity (84.6–96.1%). In full-text screening, Gemini 2.5 Pro showed the most consistent performance across repeated evaluations (κ = 0.839 in both trials), whereas GPT-5 Pro exhibited marked intra-model variability (κ = 0.399 to 0.920). Conclusions: Current-generation LLMs achieve near-human accuracy in systematic review screening for PJI research, though substantial intra-model variability underscores the continued need for human oversight in systematic review workflows.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Woojin Shin

Jaeyoung Hong

Sunwoo Lee

Journals

Journal of Clinical Medicine

Actions

Institutions

Chosun University

Chosun University Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Efficacy of Large Language Models for Screening of Systematic Reviews on Periprosthetic Joint Infection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study