What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Key Points

VLA-RFT improves robustness under perturbed conditions and lowers sample requirements significantly.
With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines in task execution.
The proposed framework employs a world model to provide trajectory-level rewards, enhancing learning efficiency.
VLA-RFT establishes a practical post-training paradigm for improving generalization in vision-language-action models.

Abstract

Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hanyang Li

Pengxiang Ding

Runze Suo

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider