What type of study is this?

October 1, 2025Open Access

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

Key Points

SRPO significantly outperforms DeepSeek-R1-Zero-32B on benchmarks like AIME24, showcasing enhanced reasoning efficiency.
It achieves this with about 1/10 of the training steps compared to DeepSeek-R1-Zero-32B, indicating superior efficiency in training.
The methodology introduces a two-stage training paradigm to boost both mathematical reasoning and coding skills in LLMs.
Comprehensive experiments support SRPO's effectiveness in scaling LLM reasoning capabilities across various tasks.

Abstract

Recent advances of reasoning models, exemplified by OpenAI's o1 and DeepSeek's R1, highlight the significant potential of Reinforcement Learning (RL) to enhance the reasoning capabilities of Large Language Models (LLMs). However, replicating these advancements across diverse domains remains challenging due to limited methodological transparency. In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. SRPO achieves this using the same base model as DeepSeek (i.e. Qwen2.5-32B), using only about 1/10 of the training steps required by DeepSeek-R1-Zero-32B, demonstrating superior efficiency. Building upon Group Relative Policy Optimization (GRPO), we introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples. Our comprehensive experiments validate the effectiveness of our approach, offering valuable insights into scaling LLM reasoning capabilities across diverse tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhang et al. (Sat,) studied this question.

www.synapsesocial.com/papers/68dd91c7fe798ba2fc4982cc — DOI: https://doi.org/10.48550/arxiv.2504.14286

Authors

Xiaojiang Zhang

Jinghui Wang

Zifei Cheng

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion