Vision-Language-Action (VLA) models are promising as they allow to interact with a robot through natural language and are very generalist policies. VLAs attach a randomly-initialized Diffusion Transformer (DiT) to a pretrained Vision-Language Model (VLM) and train it via supervised fine-tuning (SFT) on real or synthetic demonstration data. In parallel, the development of modern physics simulator (MuJoCo, Isaac Lab) combined with reinforcement learning (RL) has produced highly capable control policies especially for the locomotion of very complex robots like humanoids or dexterous manipulation. One of the main bottleneck of physical AI is the lack of data. The main advantage of RL training is that it can be run inside a simulator. I propose to use RL to pretrain a VLA in a simulator to let it build a physical knowledge a priori. To do so, it requires to initialize the DiT. I show that this is feasible with two established architectures: a causal transformer trained by standard PPO, or a diffusion policy trained by DPPO or DDiffPG.
Building similarity graph...
Analyzing shared references across papers
Loading...
Scharffe
Building similarity graph...
Analyzing shared references across papers
Loading...
Scharffe (Mon,) studied this question.
synapsesocial.com/papers/6a0d5051f03e14405aa9bf64 — DOI: https://doi.org/10.5281/zenodo.20271022