What question did this study set out to answer?

The aim is to explore the development and progress of vision-language-action models in embodied intelligence.

April 10, 2026Open Access

面向具身智能的视觉—语言—动作模型研究进展

Key Points

The aim is to explore the development and progress of vision-language-action models in embodied intelligence.
Reviewed the evolution of visual and language foundational models.
Analyzed key technical modules of VLA including visual encoding and language representation.
Classified existing models into single system, dual system, and hierarchical categories.
Summarized pre-training and post-training strategies.
Reviewed evaluation benchmarks in simulation and real-world environments.
Identified challenges in real-time reasoning efficiency and data quality.
Discussed issues related to environment generalization and safety ethics.
Provided a comprehensive reference for future VLA research directions.

Abstract

具身智能作为人工智能与机器人学交叉的前沿领域，旨在使智能体通过与物理世界的动态交互来感知、推理并执行任务。然而，传统基于深度学习的级联式感知—控制模型在开放、动态环境下泛化能力不足，且高度依赖大规模标注数据。近年来，视觉—语言—动作模型（vision-language-action models， VLA）通过融合视觉感知、语言理解与动作生成，为具身智能的研究与应用提供了新的推动力。本文系统梳理了基于VLA的具身智能研究进展，从发展历程、模型架构、系统分类、训练与评估等方面展开综述。首先，文章追溯了视觉与语言基础模型的演进脉络，并阐述VLA概念的提出背景；随后，本文深入剖析VLA的关键技术模块，包括视觉编码、语言表征及动作词元化与解码机制；在此基础上，本文引入系统架构分类法，将现有工作归纳为单系统、双系统与层次化三类，并分析其设计权衡与适用场景；此外，本文总结了模型的预训练与后训练策略，并梳理了仿真及真实环境下的主流评测基准；最后，本文分析了VLA在实时推理效率、数据质量、环境泛化性与安全伦理等维度面临的挑战，并展望从被动感知到主动推理、持续学习、场景泛化与可靠部署等未来方向。本文旨在为相关研究者提供系统的技术参考，推动VLA在开放世界具身任务中的理论发展和应用落地。本文提及的算法、数据集和评估指标已汇总至https：//github.com/DefaultRui/vision-language-action-models-for-embodied-AI。

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Rui et al. (Thu,) studied this question.

synapsesocial.com/papers/69d893a86c1944d70ce04abe — DOI: https://doi.org/10.11834/jig.250544

Authors

Liu Rui

Wang Wenguan

Wang Jun

Journals

Journal of Image and Graphics

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

面向具身智能的视觉—语言—动作模型研究进展

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion