We show that the token-level probabilities a causal language model assigns to its own training text – what we call teacher-forced confidence – function as a practical sensor for false beliefs encoded in that text. Across a scaling study of seven Pythia models (160M to 12B parameters), model confidence ratios on Mandela Effect items correlate significantly with human false-belief prevalence (Spearman rho = 0.718, p = 0.006, n = 13 items at 1B; rho = 0.652, p = 0.016 at 410M and 6.9B). The signal generalizes out-of-domain to medical misconceptions (88% binary classification accuracy at 6.9B, p = 0.01), scales monotonically with model size (71% at 160M to 92% at 12B on a truth-detection benchmark), and emerges stably by training step 256 across checkpoints. We interpret this as evidence that teacher-forced confidence tracks the transmissibility of beliefs in training corpora rather than their factual truth. As a practical application, we show that targeted resampling at low-confidence token positions, rather than uniform best-of-N regeneration, achieves comparable accuracy improvements at 3-5x lower compute cost. These results suggest that internal model probabilities, without any fine-tuning or probing, carry exploitable structure about the epistemic status of encoded claims.
Bryan Sanchez (Thu,) studied this question.