What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

TreeRPO: Tree Relative Policy Optimization

Key Points

TreeRPO enhances LLMs by fine-tuning reward signals for intermediate reasoning steps, improving their performance.
The algorithm improves average Pass@1 accuracy of Qwen-2.5-Math from 19.0% to 35.5%, demonstrating significant gains.
TreeRPO innovatively computes rewards using tree sampling rather than a separate reward model, streamlining the process.
Compared to GRPO, TreeRPO improves performance by 2.9% while reducing response length by 18.1%, indicating efficiency.

Abstract

Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our algorithm substantially improves the average Pass@1 accuracy of Qwen-2. 5-Math on test benchmarks, increasing it from 19. 0\% to 35. 5\%. Furthermore, significantly outperforms GRPO by 2. 9\% in performance while simultaneously reducing the average response length by 18. 1\%, showcasing its effectiveness and efficiency. Our code will be available at https: //github. com/yangzhch6/TreeRPOhttps: //github. com/yangzhch6/TreeRPO.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zhicheng Yang

Zhijiang Guo

Yinya Huang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

TreeRPO: Tree Relative Policy Optimization

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider