What question did this study set out to answer?

The study aims to compare the reasoning capabilities of reasoning-focused and non-reasoning large language models across various tasks and domains.

April 4, 2026Open Access

An Empirical Study on Reasoning and Generalization in Large Language Models

Key Points

The study aims to compare the reasoning capabilities of reasoning-focused and non-reasoning large language models across various tasks and domains.
Conducted a comparative analysis of reasoning-focused and non-reasoning large language models.
Evaluated performance on mathematical and general reasoning benchmarks across 30 sub-disciplines.
Assessed qualitative and quantitative differences in solution processes.
Developed a multi-level evaluation framework to enhance reliability against human experts.
Reasoning-focused models significantly outperform non-reasoning models on logic-intensive tasks like mathematics and physics.
DeepSeek-R1 shows more elaborate reasoning processes, correlating with better performance on harder tasks.
GPT-o1 is more concise and clear on simpler problems but struggles with complex scenarios.
Distillation partially preserves reasoning abilities, with distilled models consistently outperforming non-reasoning models.

Abstract

Abstract Recent reasoning-focused large language models (LLMs) (e.g., GPT-o1 and DeepSeek-R1) combine chain-of-thought prompting with reinforcement learning to generate explicit, multistep derivations. However, their practical advantages over standard instruction-tuned models remain underexplored. In this paper, we present a comprehensive empirical study to address these gaps. We compare reasoning-focused and non-reasoning LLMs across several critical dimensions: (1) their performance on mathematical and general reasoning benchmarks; (2) their generalization capabilities across 30 sub-disciplines spanning six scientific domains; (3) the qualitative and quantitative differences in their underlying solution processes; and (4) the impact of model compression via distillation on core reasoning capabilities. Crucially, to reliably conduct our in-depth process analysis, we address the critical issue of LLM-as-a-Judge reliability by proposing and validating a multi-level evaluation framework against human experts. Our experimental results demonstrate that reasoning-focused models significantly outperform nonreasoning counterparts on logic-intensive tasks such as mathematics and physics, while the latter remain competitive on factual or observational problems (e.g., biology, geography). Specifically, our validated process analysis reveals distinct styles: DeepSeek-R1 generates more elaborate reasoning processes, correlating with greater effectiveness on harder tasks, despite potential redundancy. In contrast, GPT-o1 exhibits greater conciseness and frequently presents solutions with superior clarity and efficiency, particularly on simpler problems, while struggling with complex scenarios. Additionally, reasoning capabilities can be partially preserved through distillation, with distilled reasoning models consistently outperforming their non-reasoning counterparts across all domains. These findings provide new insights into when and how explicit reasoning contributes to LLM performance, offering practical guidance for model development and deployment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Lu Xiang

Yang Zhao

Yaping Zhang

Journals

Computational Linguistics

Actions

Institutions

Institute of Automation

Shandong Institute of Automation

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

An Empirical Study on Reasoning and Generalization in Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study