What question did this study set out to answer?

The aim is to assess how MMLU-Pro performs under various answer-interface perturbations and correct prediction comparability issues.

May 21, 2026Open Access

MMLU-Pro Under Admissible Interface Perturbations: A Three-Family Stress Test with Prediction-Space Canonicalization

Read Full Paperexternally

Key Points

The aim is to assess how MMLU-Pro performs under various answer-interface perturbations and correct prediction comparability issues.
Conducted a controlled three-family stress test on MMLU-Pro using 140 items across 14 categories.
Implemented full prediction-space canonicalization with exact decoder recovery.
Evaluated changes in family-level perturbation effects and prediction stability.
Family-level perturbation signatures remained unchanged after canonicalization.
Raw prediction-level instability was reduced but not eliminated.
MMLU-Pro showed local usability but limited global neutrality under tested perturbations.

Abstract

This technical note presents a controlled three-family stress test on MMLU-Pro under admissible answer-interface perturbations (baseline, choiceₛhuffle, labelᵣemap). Three model families are evaluated on a locked subset of 140 items spanning 14 categories. A methodological caveat affecting prediction comparability is explicitly identified and corrected through full prediction-space canonicalization with exact decoder recovery. The family-level perturbation signatures remain unchanged after canonicalization, while part of the raw prediction-level instability is reduced but not eliminated. The result is diagnostic and local in scope: under the tested setup, MMLU-Pro remains locally usable but exhibits interface-sensitive evaluative closure and limited global neutrality under the tested perturbations. Version update: supplementary verification package added. This version adds a supplementary verification package supporting deterministic reconstruction of the reported three-family condition-level counts and accuracies from the canonical long-format results file. The package includes the canonical long-format CSV, verification script, verification notebook, deterministic count/accuracy reconstruction outputs, family-level profiles, item-level correctness summaries, auxiliary bootstrap summaries, and SHA256 manifest. This supplementary package does not replace the original reproducibility material and does not rerun model inference from scratch. Bootstrap-related outputs are auxiliary consistency checks and are not required for reproducing the primary reported accuracy counts. No scientific claim, title, authorship, or main manuscript conclusion is changed in this version.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Danilo Tavella

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

MMLU-Pro Under Admissible Interface Perturbations: A Three-Family Stress Test with Prediction-Space Canonicalization

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider