What question did this study set out to answer?

This study aims to evaluate the reliability and consistency of large language models in assessing academic abstracts compared to human reviewers.

May 8, 2026Open Access

Evaluating large language models for abstract evaluation tasks: an empirical study

Puntos clave

This study aims to evaluate the reliability and consistency of large language models in assessing academic abstracts compared to human reviewers.
Three LLMs assessed 160 conference abstracts independently, while 14 human reviewers used the same rubric to grade subsets.
Scoring patterns were analyzed through intraclass correlation coefficients and Bland-Altman plots for bias assessment.
Inter-rater reliability among LLMs and between LLMs and human reviewers was calculated.
LLMs demonstrated good internal consistency with ICCs ranging from 0.59 to 0.87.
ChatGPT and Claude showed moderate agreement with human reviewers across several criteria (ICCs: 0.45–0.60), but lower on subjective measures (ICCs: 0.23–0.38).
Gemini exhibited poor reliability on subjective criteria and acceptable systematic bias in ratings compared to human evaluators.

Resumen

Introduction Large language models (LLMs) show great promise as tools for assisting scientific peer review, but their agreement with human experts in quantitative assessment of academic content needs further investigation. This study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5′s consistency and reliability in evaluating conference abstracts compared to one another and to human reviewers. Methods Three LLMs independently graded 160 abstracts from a regional conference, while 14 human reviewers each assessed a subset using an identical rubric with eight criteria scored on a 1–5 scale. We compared AI and human scoring patterns using boxplots, calculated intraclass correlation coefficients (ICCs) for inter-rater reliability both among LLMs and between human and LLMs, and examined Bland-Altman plots to identify agreement patterns and systematic bias. Results Three LLMs demonstrated high internal consistency with narrow interquartile ranges and few outliers in composite scores, while human reviewers exhibited greater scoring variability. LLMs also achieved good-to-excellent agreement with each other across all criteria (ICCs: 0.59–0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs = 0.45–0.60 for composite score, impression, clarity, objective, and results. The two LLMs' concordance with humans achieved fair levels on subjective dimensions, with ICC ranging from 0.23–0.38 for impact, engagement, and applicability. Gemini performed notably worse, showing fair agreement on half the criteria and poor reliability on impact and applicability. Bland-Altman analysis revealed acceptable or negligible systematic bias, with mean differences of 0.24 (ChatGPT), 0.42 (Gemini), and −0.02 (Claude) from human mean ratings. Discussion With appropriate model selection, LLMs could reach moderate agreement with human experts on abstract overall quality and objective criteria, supporting their potential use for pre-screening low-quality submissions or serving as additional reviewers. Their ability to apply rubrics consistently across large volumes of abstracts offers advantages in efficiency and standardization that exceed human feasibility. However, LLMs' reduced performance on subjective dimensions indicates that they should complement rather than replace human judgment in abstract evaluation, with expert review remaining essential for comprehensive assessment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Liu et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69fd7cd4bfa21ec5bbf05b19 — DOI: https://doi.org/10.3389/frma.2026.1807672

Authors

Yinuo Liu

Emre Sezgin

Eric A. Youngstrom

Journals

Frontiers in Research Metrics and Analytics

Actions

Institutions

The Ohio State University

Nationwide Children's Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating large language models for abstract evaluation tasks: an empirical study

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion