What question did this study set out to answer?

This research examines how chain-of-thought (CoT) strategies affect performance of large language models (LLMs) on various coding benchmarks.

June 1, 2026Open Access

When Chain-of-Thought Helps and When It Hurts: A Communication-Complexity Account of LLM Benchmark Behaviour via the Hdp Bandwidth Bound

Read Full Paperexternally

Key Points

This research examines how chain-of-thought (CoT) strategies affect performance of large language models (LLMs) on various coding benchmarks.
Implementation of updated scoring for HumanEval with token stripping for accurate output evaluation.
Comparison across models: Qwen-32B, Qwen-7B, and Llama-8B.
Application of McNemar tests post-correction for statistical significance.
The performance of Qwen-32B improved by +23.2 pp with CoT after correction, down from +68.9 pp.
Baseline no-CoT accuracy for Qwen-32B adjusted from 15.9% to 62.2%.
CoT strategy negatively impacts smaller models like Qwen-7B, showing a decline of -28.7 pp.

Abstract

v3 — Substantive correction of the v2 (May 2026) release. CHANGE SUMMARY Inference outputs are unchanged; only the scoring of HumanEval was affected. A parsing oversight in evaluating raw Python outputs failed to strip stop tokens (e. g. , ``), causing functional execution (`SyntaxError`) failures for correctly generated code. This artificially suppressed the no-CoT scores, especially for Qwen-32B (where it dropped to 15. 9%). With proper token stripping (released as an updated results/scoreₕumaneval. py), the dramatic +68. 9 pp CoT boost for Qwen-32B is reduced to +23. 2 pp. The core finding of a model-size-dependent transition on L-complexity tasks is preserved, but the effect magnitudes are much smaller. KEY NUMBER CHANGES - HumanEval CoT delta (Qwen-32B): was +68. 9 pp (v2) -> +23. 2 pp (v3). Baseline no-CoT accuracy corrected from 15. 9% to 62. 2%. - HumanEval CoT delta (Qwen-7B): was -27. 4 pp (v2) -> -28. 7 pp (v3). - HumanEval CoT delta (Llama-8B): was +15. 9 pp (v2) -> +9. 1 pp (v3). - Pre-registered McNemar tests significant after Bonferroni: was 10/15 (v2) -> 9/15 (v3). The Llama-8B HumanEval cell is no longer significant. - GSM8K, MATH, MMLU, ARC-Challenge deltas: unchanged from v2. WHAT v3 ARGUES The core thesis remains the same as v2: The math-side prediction of the Hdp framework (CoT recovers single-pass bandwidth) is strongly supported across all three models on GSM8K and MATH. The negative TC⁰ prediction (CoT actively hurts low-depth tasks) is not supported: CoT is approximately neutral on MMLU and ARC. HumanEval continues to show the predicted model-size-dependent transition (+23. 2 pp for Qwen-32B, +9. 1 pp for Llama-8B, -28. 7 pp for Qwen-7B), confirming that CoT hurts smaller models but helps larger models on intermediate-complexity tasks. PROVENANCE A parser artefact caused the models to receive abnormally low no-CoT scores on HumanEval because special tokens (e. g. ``) were not stripped prior to functional execution, resulting in tracebacks. The `scoreₕumaneval. py` script was updated to strip these tags using regex. This correction has been integrated into the provided replication datasets and the SQLite database.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tughanbulut Kurtulush

Actions

Institutions

Tula University

Vistula University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

When Chain-of-Thought Helps and When It Hurts: A Communication-Complexity Account of LLM Benchmark Behaviour via the Hdp Bandwidth Bound

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study