What question did this study set out to answer?

The study aims to improve answer selection from language models by utilizing internal hidden state computations rather than relying on majority voting.

February 17, 2026Open Access

Beyond Majority Voting: Selecting LLM Answers via Hidden State Trajectory Probes

Key Points

The study aims to improve answer selection from language models by utilizing internal hidden state computations rather than relying on majority voting.
Developed trajectory probes to analyze hidden state features across answer generations.
Extracted mean, standard deviation, and final-token activations from eight layers of the model.
Trained a logistic regression model using pairwise ranking to prefer correct answers over incorrect ones.
Trajectory probes achieved a 56.4% accuracy compared to 51.3% for majority voting on TriviaQA.
Improvements noted in selection precision with a 91.2% accuracy when at least one answer was correct.
Findings indicate that the training objective significantly impacts performance over standalone classifier quality.

Abstract

When a language model generates multiple candidate answers, how should we pick the best one? The default strategy - majority voting - treats the model as a black box, discarding everything except final answer strings. We show that the model's internal computations already contain a usable signal for answer quality, and that a remarkably simple method can extract it. We propose trajectory probes: lightweight linear classifiers trained on hidden state features aggregated across the generation process. From each candidate answer, we extract mean, standard deviation, and final-token activations at eight evenly spaced layers, projected to 256 dimensions - a 6, 144-dimensional trajectory fingerprint. A logistic regression probe trained with a pairwise ranking objective (RankNet) learns to prefer correct answers over incorrect ones from the same question. On TriviaQA (Llama-3. 1-8B-Instruct, K=4, T=0. 3; mean+/-std over 3 seeds), the probe reaches 56. 4%+/-3. 9 versus 51. 3%+/-3. 9 for majority voting, recovering 58. 4%+/-3. 0% of the gap to the oracle upper bound, with a selection precision (PickAcc) of 91. 2%+/-1. 7% on questions where at least one sampled answer is correct. On MATH, gains are smaller and strongly K-dependent: at Kₑval=2 the probe improves over majority voting by +2. 1 points (3/3 seeds positive), while at the canonical Kₑval=4 the improvement narrows to +0. 6+/-1. 0 points and is not statistically significant. Two findings surprised us. First, the choice of training objective can matter more than standalone classifier quality: a binary classifier with higher cross-validated AUC can underperform a pairwise probe with lower AUC, because ranking among candidates is a different task than classifying correctness in isolation. Second, the per-layer signal distribution acts as a domain fingerprint - factual recall spreads information across layers while mathematical reasoning concentrates it in the final third - yet a single probe trained on mixed-domain data can match domain-specific specialists. Our results suggest that the "verifier" for best-of-K selection need not be a separate model or an additional LLM call. It can be a linear function of what the model already computes.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Nikolay Yudin

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Beyond Majority Voting: Selecting LLM Answers via Hidden State Trajectory Probes

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study