What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Key Points

VLA-RFT improves robustness under perturbed conditions and lowers sample requirements significantly.
With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines in task execution.
The proposed framework employs a world model to provide trajectory-level rewards, enhancing learning efficiency.
VLA-RFT establishes a practical post-training paradigm for improving generalization in vision-language-action models.

Abstract

Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Li et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e24e6fd6d66a53c2473f29 — DOI: https://doi.org/10.48550/arxiv.2510.00406

Authors

Hanyang Li

Pengxiang Ding

Runze Suo

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion