What type of study is this?

This is a Experimental Study study.

October 20, 2025Open Access

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

Key Points

PipelineRL achieves approximately 2x faster learning, enhancing the efficiency of reinforcement learning models.
Experiments conducted on long-form reasoning tasks with 128 H100 GPUs show maintained data on-policyness.
Concurrent asynchronous data generation and model training optimize both hardware utilization and data freshness.
The implementation of PipelineRL is scalable and modular, offering significant contributions to reinforcement learning frameworks.

Abstract

Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately 2x faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Piché et al. (Tue,) studied this question.

www.synapsesocial.com/papers/68f6196ee0bbbc94fac362fd — DOI: https://doi.org/10.48550/arxiv.2509.19128

Authors

Alexandre Piché

Ehsan Kamalloo

Rafael Pardinas

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion