Abstract Recent reasoning-focused large language models (LLMs) (e.g., GPT-o1 and DeepSeek-R1) combine chain-of-thought prompting with reinforcement learning to generate explicit, multistep derivations. However, their practical advantages over standard instruction-tuned models remain underexplored. In this paper, we present a comprehensive empirical study to address these gaps. We compare reasoning-focused and non-reasoning LLMs across several critical dimensions: (1) their performance on mathematical and general reasoning benchmarks; (2) their generalization capabilities across 30 sub-disciplines spanning six scientific domains; (3) the qualitative and quantitative differences in their underlying solution processes; and (4) the impact of model compression via distillation on core reasoning capabilities. Crucially, to reliably conduct our in-depth process analysis, we address the critical issue of LLM-as-a-Judge reliability by proposing and validating a multi-level evaluation framework against human experts. Our experimental results demonstrate that reasoning-focused models significantly outperform nonreasoning counterparts on logic-intensive tasks such as mathematics and physics, while the latter remain competitive on factual or observational problems (e.g., biology, geography). Specifically, our validated process analysis reveals distinct styles: DeepSeek-R1 generates more elaborate reasoning processes, correlating with greater effectiveness on harder tasks, despite potential redundancy. In contrast, GPT-o1 exhibits greater conciseness and frequently presents solutions with superior clarity and efficiency, particularly on simpler problems, while struggling with complex scenarios. Additionally, reasoning capabilities can be partially preserved through distillation, with distilled reasoning models consistently outperforming their non-reasoning counterparts across all domains. These findings provide new insights into when and how explicit reasoning contributes to LLM performance, offering practical guidance for model development and deployment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Lu Xiang
Yang Zhao
Yaping Zhang
Computational Linguistics
Institute of Automation
Shandong Institute of Automation
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d0af83659487ece0fa5859 — DOI: https://doi.org/10.1162/coli.a.619