v3 — Substantive correction of the v2 (May 2026) release. CHANGE SUMMARY Inference outputs are unchanged; only the scoring of HumanEval was affected. A parsing oversight in evaluating raw Python outputs failed to strip stop tokens (e. g. , ``), causing functional execution (`SyntaxError`) failures for correctly generated code. This artificially suppressed the no-CoT scores, especially for Qwen-32B (where it dropped to 15. 9%). With proper token stripping (released as an updated results/scoreₕumaneval. py), the dramatic +68. 9 pp CoT boost for Qwen-32B is reduced to +23. 2 pp. The core finding of a model-size-dependent transition on L-complexity tasks is preserved, but the effect magnitudes are much smaller. KEY NUMBER CHANGES - HumanEval CoT delta (Qwen-32B): was +68. 9 pp (v2) -> +23. 2 pp (v3). Baseline no-CoT accuracy corrected from 15. 9% to 62. 2%. - HumanEval CoT delta (Qwen-7B): was -27. 4 pp (v2) -> -28. 7 pp (v3). - HumanEval CoT delta (Llama-8B): was +15. 9 pp (v2) -> +9. 1 pp (v3). - Pre-registered McNemar tests significant after Bonferroni: was 10/15 (v2) -> 9/15 (v3). The Llama-8B HumanEval cell is no longer significant. - GSM8K, MATH, MMLU, ARC-Challenge deltas: unchanged from v2. WHAT v3 ARGUES The core thesis remains the same as v2: The math-side prediction of the Hdp framework (CoT recovers single-pass bandwidth) is strongly supported across all three models on GSM8K and MATH. The negative TC⁰ prediction (CoT actively hurts low-depth tasks) is not supported: CoT is approximately neutral on MMLU and ARC. HumanEval continues to show the predicted model-size-dependent transition (+23. 2 pp for Qwen-32B, +9. 1 pp for Llama-8B, -28. 7 pp for Qwen-7B), confirming that CoT hurts smaller models but helps larger models on intermediate-complexity tasks. PROVENANCE A parser artefact caused the models to receive abnormally low no-CoT scores on HumanEval because special tokens (e. g. ``) were not stripped prior to functional execution, resulting in tracebacks. The `scoreₕumaneval. py` script was updated to strip these tags using regex. This correction has been integrated into the provided replication datasets and the SQLite database.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tughanbulut Kurtulush
Tula University
Vistula University
Building similarity graph...
Analyzing shared references across papers
Loading...
Tughanbulut Kurtulush (Sat,) studied this question.
synapsesocial.com/papers/6a1d22bb02fbce91306385ff — DOI: https://doi.org/10.5281/zenodo.20463677