What type of study is this?

This is a Quantitative Study study.

October 13, 2025Open Access

When an LLM is apprehensive about its answers -- and when its uncertainty is justified

Key Points

Token-wise entropy serves as an effective predictor of model errors in knowledge-dependent domains.
Experiments involving three LLMs of varying sizes showed variance in accuracy across different domains, specifically a ROC AUC of 0.73 for biology.
Model-as-judge performs similarly to random predictors, highlighting the need for refinement in this uncertainty estimation method.
To achieve fair LLM assessments, the data uncertainty related to entropy should integrate reasoning requirements across subdomains.

Abstract

Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1. 5B to 72B and 14 topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is 0. 73. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is 0. 55. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sychev et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68ece2abd1bb2827d1297344 — DOI: https://doi.org/10.48550/arxiv.2503.01688

Authors

Petr Sychev

Andrey Goncharov

Daniil Vyazhev

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

When an LLM is apprehensive about its answers -- and when its uncertainty is justified

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion