Vision-Language-Action (VLA) models have shown remarkable achievements, driven by the rich implicit knowledge of their vision-language components. However, achieving generalist robotic agents demands precise grounding into physical interactions, especially in contact-rich scenarios where fine-grained force control is essential. We advance VLAs' implicit knowledge beyond identifying what to do, towards guiding how to physically interact with real world. This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing. This framework incorporates a hybrid position-force controller to translate the model's intentions into precise physical actions and a reasoning module that allows the robot to adapt its strategy based on tactile feedback. Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects: (1) enabling tactile-aware instruction following, (2) utilizing tactile-relevant commonsense, and (3) facilitating adaptive tactile-involved reasoning. A key finding is that the VLM's prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot's tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jialei Huang
Shuo Wang
Fanqi Lin
Building similarity graph...
Analyzing shared references across papers
Loading...
Huang et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68de5d9c83cbc991d0a204d0 — DOI: https://doi.org/10.48550/arxiv.2507.09160
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: