What does this research mean for the field?

Reinforcement learning in physics simulators can successfully pretrain the Diffusion Transformer component of Vision-Language-Action models to establish physical priors, using either causal transformers trained by PPO or diffusion policies trained by DPPO or DDiffPG. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to integrate reinforcement learning with vision-language-action models to enhance their physical knowledge.

May 20, 2026Open Access

RL-Pretrained Action Experts for Vision-Language-Action Models: Towards Physical Priors in Diffusion Policy Initialization

Read Full Paperexternally

Key Points

This research aims to integrate reinforcement learning with vision-language-action models to enhance their physical knowledge.
Proposed the use of RL for pretraining VLA models in a simulator.
Utilized two architectures: a causal transformer trained by PPO and a diffusion policy trained by DPPO or DDiffPG.
Combined VLA models with pretrained vision-language models and worked with both real and synthetic data.
Showed feasibility of pretraining VLA models using RL in a simulator.
Demonstrated improvements in initializing diffusion transformers for better physical interaction with robots.

Abstract

Vision-Language-Action (VLA) models are promising as they allow to interact with a robot through natural language and are very generalist policies. VLAs attach a randomly-initialized Diffusion Transformer (DiT) to a pretrained Vision-Language Model (VLM) and train it via supervised fine-tuning (SFT) on real or synthetic demonstration data. In parallel, the development of modern physics simulator (MuJoCo, Isaac Lab) combined with reinforcement learning (RL) has produced highly capable control policies especially for the locomotion of very complex robots like humanoids or dexterous manipulation. One of the main bottleneck of physical AI is the lack of data. The main advantage of RL training is that it can be run inside a simulator. I propose to use RL to pretrain a VLA in a simulator to let it build a physical knowledge a priori. To do so, it requires to initialize the DiT. I show that this is feasible with two established architectures: a causal transformer trained by standard PPO, or a diffusion policy trained by DPPO or DDiffPG.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Scharffe

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

RL-Pretrained Action Experts for Vision-Language-Action Models: Towards Physical Priors in Diffusion Policy Initialization

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study