What type of study is this?

This is a Quantitative Study study (also classified as: Experimental Study).

October 9, 2025Open Access

VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

Puntos clave

The VTLA model outperforms traditional methods, achieving over 90% success rates on insertion tasks.
A low-cost, multi-modal dataset was constructed for training, focusing on vision-tactile-action-instruction pairs.
Direct Preference Optimization bridges continuous robotic tasks with classification-based models, enhancing performance.
Real-world experiments confirm the Sim2Real capability of the VTLA model, highlighting its practical application.

Resumen

While vision-language models have advanced significantly, their application in language-conditioned robotic manipulation is still underexplored, especially for contact-rich tasks that extend beyond visually dominant pick-and-place scenarios. To bridge this gap, we introduce Vision-Tactile-Language-Action model, a novel framework that enables robust policy generation in contact-intensive scenarios by effectively integrating visual and tactile inputs through cross-modal language grounding. A low-cost, multi-modal dataset has been constructed in a simulation environment, containing vision-tactile-action-instruction pairs specifically designed for the fingertip insertion task. Furthermore, we introduce Direct Preference Optimization (DPO) to offer regression-like supervision for the VTLA model, effectively bridging the gap between classification-based next token prediction loss and continuous robotic tasks. Experimental results show that the VTLA model outperforms traditional imitation learning methods (e.g., diffusion policies) and existing multi-modal baselines (TLA/VLA), achieving over 90% success rates on unseen peg shapes. Finally, we conduct real-world peg-in-hole experiments to demonstrate the exceptional Sim2Real performance of the proposed VTLA model. For supplementary videos and results, please visit our project website: https://sites.google.com/view/vtla

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Chaofan Zhang

Hao Peng

Xin Cao

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider