When a language model generates multiple candidate answers, how should we pick the best one? The default strategy - majority voting - treats the model as a black box, discarding everything except final answer strings. We show that the model's internal computations already contain a usable signal for answer quality, and that a remarkably simple method can extract it. We propose trajectory probes: lightweight linear classifiers trained on hidden state features aggregated across the generation process. From each candidate answer, we extract mean, standard deviation, and final-token activations at eight evenly spaced layers, projected to 256 dimensions - a 6, 144-dimensional trajectory fingerprint. A logistic regression probe trained with a pairwise ranking objective (RankNet) learns to prefer correct answers over incorrect ones from the same question. On TriviaQA (Llama-3. 1-8B-Instruct, K=4, T=0. 3; mean+/-std over 3 seeds), the probe reaches 56. 4%+/-3. 9 versus 51. 3%+/-3. 9 for majority voting, recovering 58. 4%+/-3. 0% of the gap to the oracle upper bound, with a selection precision (PickAcc) of 91. 2%+/-1. 7% on questions where at least one sampled answer is correct. On MATH, gains are smaller and strongly K-dependent: at Kₑval=2 the probe improves over majority voting by +2. 1 points (3/3 seeds positive), while at the canonical Kₑval=4 the improvement narrows to +0. 6+/-1. 0 points and is not statistically significant. Two findings surprised us. First, the choice of training objective can matter more than standalone classifier quality: a binary classifier with higher cross-validated AUC can underperform a pairwise probe with lower AUC, because ranking among candidates is a different task than classifying correctness in isolation. Second, the per-layer signal distribution acts as a domain fingerprint - factual recall spreads information across layers while mathematical reasoning concentrates it in the final third - yet a single probe trained on mixed-domain data can match domain-specific specialists. Our results suggest that the "verifier" for best-of-K selection need not be a separate model or an additional LLM call. It can be a linear function of what the model already computes.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nikolay Yudin
Building similarity graph...
Analyzing shared references across papers
Loading...
Nikolay Yudin (Sun,) studied this question.
www.synapsesocial.com/papers/6994058c4e9c9e835dfd67fd — DOI: https://doi.org/10.5281/zenodo.18649682