This technical note presents a controlled three-family stress test on MMLU-Pro under admissible answer-interface perturbations (baseline, choiceₛhuffle, labelᵣemap). Three model families are evaluated on a locked subset of 140 items spanning 14 categories. A methodological caveat affecting prediction comparability is explicitly identified and corrected through full prediction-space canonicalization with exact decoder recovery. The family-level perturbation signatures remain unchanged after canonicalization, while part of the raw prediction-level instability is reduced but not eliminated. The result is diagnostic and local in scope: under the tested setup, MMLU-Pro remains locally usable but exhibits interface-sensitive evaluative closure and limited global neutrality under the tested perturbations. Version update: supplementary verification package added. This version adds a supplementary verification package supporting deterministic reconstruction of the reported three-family condition-level counts and accuracies from the canonical long-format results file. The package includes the canonical long-format CSV, verification script, verification notebook, deterministic count/accuracy reconstruction outputs, family-level profiles, item-level correctness summaries, auxiliary bootstrap summaries, and SHA256 manifest. This supplementary package does not replace the original reproducibility material and does not rerun model inference from scratch. Bootstrap-related outputs are auxiliary consistency checks and are not required for reproducing the primary reported accuracy counts. No scientific claim, title, authorship, or main manuscript conclusion is changed in this version.
Building similarity graph...
Analyzing shared references across papers
Loading...
Danilo Tavella
Building similarity graph...
Analyzing shared references across papers
Loading...
Danilo Tavella (Tue,) studied this question.
www.synapsesocial.com/papers/6a0ea17cbe05d6e3efb60285 — DOI: https://doi.org/10.5281/zenodo.20289697
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: