What question did this study set out to answer?

The research aims to explore how teacher-forced confidence in language models can indicate the presence of false beliefs in training text.

February 21, 2026Open Access

Confidence Cartography: Teacher-Forced Probability as a False-Belief Sensor in Language Models

Key Points

The research aims to explore how teacher-forced confidence in language models can indicate the presence of false beliefs in training text.
Conducted a scaling study with seven Pythia models varying from 160M to 12B parameters.
Analyzed correlations between model confidence ratios on Mandela Effect items and human false-belief prevalence.
Implemented a truth-detection benchmark to compare accuracy across different model sizes.
Evaluated the effectiveness of targeted resampling of low-confidence tokens for accuracy improvement.
Significant correlation between confidence ratios on Mandela Effect items and false-belief prevalence (Spearman rho = 0.718, p = 0.006).
Out-of-domain generalization to medical misconceptions with 88% classification accuracy at 6.9B parameters.
Confidence scales with model size, achieving 71% accuracy at 160M to 92% at 12B.
Stable emergence of signals from training step 256 across checkpoints.

Abstract

We show that the token-level probabilities a causal language model assigns to its own training text – what we call teacher-forced confidence – function as a practical sensor for false beliefs encoded in that text. Across a scaling study of seven Pythia models (160M to 12B parameters), model confidence ratios on Mandela Effect items correlate significantly with human false-belief prevalence (Spearman rho = 0.718, p = 0.006, n = 13 items at 1B; rho = 0.652, p = 0.016 at 410M and 6.9B). The signal generalizes out-of-domain to medical misconceptions (88% binary classification accuracy at 6.9B, p = 0.01), scales monotonically with model size (71% at 160M to 92% at 12B on a truth-detection benchmark), and emerges stably by training step 256 across checkpoints. We interpret this as evidence that teacher-forced confidence tracks the transmissibility of beliefs in training corpora rather than their factual truth. As a practical application, we show that targeted resampling at low-confidence token positions, rather than uniform best-of-N regeneration, achieves comparable accuracy improvements at 3-5x lower compute cost. These results suggest that internal model probabilities, without any fine-tuning or probing, carry exploitable structure about the epistemic status of encoded claims.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Bryan Sanchez (Thu,) studied this question.

synapsesocial.com/papers/69994cd2873532290d021a8b https://doi.org/https://doi.org/10.5281/zenodo.18703505

Bookmark

View Full Paper