What question did this study set out to answer?

The research aims to establish a governance framework for large language models to manage human oversight in literature reviews effectively.

February 2, 2026Open Access

Stage-Aware Governance of Large Language Models: Managing Uncertainty and Human Oversight in AI-Assisted Literature Review Systems

Key Points

The research aims to establish a governance framework for large language models to manage human oversight in literature reviews effectively.
Evaluated three large language models in a controlled two-stage literature review workflow.
Conducted title-and-abstract screening and eligibility assessment using fixed inclusion criteria.
Outputs were benchmarked against expert consensus with reproducible conditions and standardized prompts.
LLMs matched expert decisions during screening with high precision (0.83–0.91) and F1 (up to 0.89).
Performance dropped during eligibility assessment (F1 0.58–0.65), indicating higher uncertainty.
Disagreements were more prevalent in borderline cases, underscoring the need for structured oversight.

Abstract

This study proposes a stage-aware governance framework for large language models (LLMs) that structures human oversight and accountability across different decision stages in AI-assisted literature review systems. Large language models (LLMs) are increasingly embedded in systematic review workflows, yet how human oversight and accountability should be structured across different decision stages remains unclear. This study evaluates three LLMs in a controlled two-stage literature review workflow—title-and-abstract screening and eligibility assessment—using identical evidence inputs and fixed inclusion criteria, with outputs benchmarked against expert consensus under fully reproducible conditions with standardized prompts and comprehensive logging. While LLMs closely matched expert decisions during screening (precision 0.83–0.91; F1 up to 0.89; Cohen’s κ 0.65–0.85), performance degraded substantially at the eligibility stage (F1 0.58–0.65; κ 0.52–0.62), indicating increased epistemic uncertainty when fine-grained criteria must be inferred from abstract-level information. Importantly, disagreements clustered in borderline cases rather than random error, supporting a stage-aware governance approach in which LLMs automate high-throughput screening while inter-model disagreement is operationalized as an actionable uncertainty signal that triggers human oversight in more consequential decision stages. These findings highlight the need for explicit oversight thresholds, responsibility allocation, and auditability in the responsible deployment of AI-assisted decision systems for evidence synthesis.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Kim et al. (Sat,) studied this question.

www.synapsesocial.com/papers/6980ffb4c1c9540dea81272a — DOI: https://doi.org/10.3390/systems14020153

Authors

Junic Kim

Haeyong Shin

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Stage-Aware Governance of Large Language Models: Managing Uncertainty and Human Oversight in AI-Assisted Literature Review Systems

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion