May 14, 2026Open Access

World Models and World Action Models (WAM): From Foundation Simulators to Embodied Action

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

World models—internal predictive representations that enable agents to simulate future states, anticipate consequences, and plan actions—have emerged as a foundational paradigm in embodied artificial intelligence. Originating from model-based reinforcement learning, this field has undergone a radical transformation with the advent of large-scale generative models, blurring the historical boundary between passive video prediction and interactive physical simulation. Concurrently, Vision-Language-Action (VLA) models have established a powerful framework for grounding high-level linguistic intent in low-level motor control. The natural convergence of these two threads—predictive world simulation and action-grounded multimodal reasoning—has given rise to Embodied World Action Models (WAMs), representing a new frontier in which agents learn to act by imagining their futures. However, the explosive growth of methods across robotics, autonomous driving, and interactive simulation has produced a fragmented landscape that lacks systematic unification. This survey presents a comprehensive and structured review of the modern world model ecosystem, encompassing 200+ key papers organized into a unified taxonomy. We systematically cover six major pillars: (i) Foundation World Models, including general-purpose interactive simulators (Genie, Cosmos, Sora) and game-specific environments (Oasis, Matrix-Game); (ii) Vision-Language-Action Models, spanning foundational architectures (RT-2, π₀, OpenVLA), driving-specific VLAs, and embodied manipulation policies; (iii) Embodied World Action Models, unifying video generation and action prediction through zero-shot policies, controllable simulation platforms, and world model-based reinforcement learning; (iv) Autonomous Driving World Models, addressing video generation, closed-loop simulation, planning policies, and geometric occupancy/BEV representations; (v) Efficiency and Evaluation, covering computational acceleration techniques and benchmarking protocols for physical plausibility; and (vi) Datasets and Ecosystems, including large-scale robot learning corpora and industry technical reports that underpin the entire field. Through this organization, we illuminate the evolutionary trajectory from passive pixel predictors to active, reasoning, and action-grounded simulators. We identify critical open challenges—including physical consistency, cross-embodiment generalization, safety verification, and the sim-to-real evaluation gap—and outline future directions toward cognitive world models, autonomous data collection, and standardized open ecosystems. This survey aims to serve as a definitive reference for researchers and practitioners advancing the next generation of embodied intelligence.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Xin Jin

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

World Models and World Action Models (WAM): From Foundation Simulators to Embodied Action

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider