Objectives This study aims to compare the reliability and accuracy of three large language models (LLMs) (Claude, Gemini and GPT) in assessing the risk of bias of nonrandomised studies using the ROBINS-I tool.Methods and analysis We conducted a secondary analysis of 171 nonrandomised studies previously assessed with Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) tool by two independent human review teams. Only studies with concordant human domain-level ratings were included. Each study was independently assessed twice by Claude, Gemini and Generative Pre-trained Transformer (GPT) using agent-based structured implementations of the ROBINS-I tool. Reliability (agreement between two runs of the same LLM) was evaluated using percent agreement and Gwet’s AC1. Accuracy (agreement with human reviewers) was assessed only for studies with consistent LLM ratings, using the same metrics.Results Claude demonstrated high reliability across all domains (79.5–98.0% agreement, AC1=0.729–0.975). Gemini showed moderate-to-high reliability (agreement 76.7–100%, AC1=0.680–1.0). GPT exhibited lower reliability overall, though domain-level agreement ranged from 70.9–95.6% (AC1=0.596–0.944). In terms of accuracy, Claude showed overall poor agreement with human reviewers (14.4–68.5% agreement; low AC1 values). Gemini demonstrated moderate-to-high accuracy in several domains, including deviations from intended interventions (79.6%, AC1=0.848) and measurement of outcomes (73.9%, AC1=0.702), with the highest overall agreement (40.0%, AC1=0.672). GPT showed variable accuracy, with the highest in measurement of outcomes (62.8%, AC1=0.571) and classification of interventions (57.8%, AC1=0.498), but poor performance in selection (14.3%, AC1 = −0.041) and overall agreement (23.0%, AC1=0.267).Conclusions Claude was internally consistent but poorly aligned with human reviewers. Gemini achieved both high reliability and moderate-to-high accuracy, whereas GPT had lower reliability and mixed accuracy. Current off-the-shelf LLMs cannot reliably perform ROBINS-I risk of bias assessments.
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69f6e6ab8071d4f1bdfc772f — DOI: https://doi.org/10.1136/bmjdh-2026-000034
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Zhen Wang
M Hassan Murad
Tamim Rajjo
SHILAP Revista de lepidopterología
Mayo Clinic
Mayo Clinic in Arizona
Mayo Clinic in Florida
Building similarity graph...
Analyzing shared references across papers
Loading...