What question did this study set out to answer?

The aim is to compare the reliability and accuracy of three language models in assessing the risk of bias using the ROBINS-I tool.

May 3, 2026Open Access

Comparison of three large language models' ability to assess the risk of bias using ROBINS-I tool

Key Points

The aim is to compare the reliability and accuracy of three language models in assessing the risk of bias using the ROBINS-I tool.
Conducted secondary analysis of 171 nonrandomised studies previously assessed using ROBINS-I tool.
Included studies with concordant human ratings and independently assessed each by three language models (Claude, Gemini, GPT).
Reliability and accuracy were evaluated using percent agreement and Gwet's AC1 metrics.
Claude showed high reliability (79.5–98.0% agreement, AC1=0.729–0.975) but poor agreement with humans (14.4–68.5%).
Gemini exhibited moderate-to-high reliability (76.7–100% agreement, AC1=0.680–1.0) and moderate accuracy in various domains (79.6% agreement).
GPT demonstrated lower reliability (70.9–95.6%) and mixed accuracy, performing best in measurement of outcomes (62.8%).

Abstract

Objectives This study aims to compare the reliability and accuracy of three large language models (LLMs) (Claude, Gemini and GPT) in assessing the risk of bias of nonrandomised studies using the ROBINS-I tool.Methods and analysis We conducted a secondary analysis of 171 nonrandomised studies previously assessed with Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) tool by two independent human review teams. Only studies with concordant human domain-level ratings were included. Each study was independently assessed twice by Claude, Gemini and Generative Pre-trained Transformer (GPT) using agent-based structured implementations of the ROBINS-I tool. Reliability (agreement between two runs of the same LLM) was evaluated using percent agreement and Gwet’s AC1. Accuracy (agreement with human reviewers) was assessed only for studies with consistent LLM ratings, using the same metrics.Results Claude demonstrated high reliability across all domains (79.5–98.0% agreement, AC1=0.729–0.975). Gemini showed moderate-to-high reliability (agreement 76.7–100%, AC1=0.680–1.0). GPT exhibited lower reliability overall, though domain-level agreement ranged from 70.9–95.6% (AC1=0.596–0.944). In terms of accuracy, Claude showed overall poor agreement with human reviewers (14.4–68.5% agreement; low AC1 values). Gemini demonstrated moderate-to-high accuracy in several domains, including deviations from intended interventions (79.6%, AC1=0.848) and measurement of outcomes (73.9%, AC1=0.702), with the highest overall agreement (40.0%, AC1=0.672). GPT showed variable accuracy, with the highest in measurement of outcomes (62.8%, AC1=0.571) and classification of interventions (57.8%, AC1=0.498), but poor performance in selection (14.3%, AC1 = −0.041) and overall agreement (23.0%, AC1=0.267).Conclusions Claude was internally consistent but poorly aligned with human reviewers. Gemini achieved both high reliability and moderate-to-high accuracy, whereas GPT had lower reliability and mixed accuracy. Current off-the-shelf LLMs cannot reliably perform ROBINS-I risk of bias assessments.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Wed,) studied this question.

www.synapsesocial.com/papers/69f6e6ab8071d4f1bdfc772f — DOI: https://doi.org/10.1136/bmjdh-2026-000034

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Zhen Wang

M Hassan Murad

Tamim Rajjo

Journals

SHILAP Revista de lepidopterología

Actions

Institutions

Mayo Clinic

Mayo Clinic in Arizona

Mayo Clinic in Florida

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Comparison of three large language models' ability to assess the risk of bias using ROBINS-I tool

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion