Los puntos clave no están disponibles para este artículo en este momento.
World models—internal predictive representations that enable agents to simulate future states, anticipate consequences, and plan actions—have emerged as a foundational paradigm in embodied artificial intelligence. Originating from model-based reinforcement learning, this field has undergone a radical transformation with the advent of large-scale generative models, blurring the historical boundary between passive video prediction and interactive physical simulation. Concurrently, Vision-Language-Action (VLA) models have established a powerful framework for grounding high-level linguistic intent in low-level motor control. The natural convergence of these two threads—predictive world simulation and action-grounded multimodal reasoning—has given rise to Embodied World Action Models (WAMs), representing a new frontier in which agents learn to act by imagining their futures. However, the explosive growth of methods across robotics, autonomous driving, and interactive simulation has produced a fragmented landscape that lacks systematic unification. This survey presents a comprehensive and structured review of the modern world model ecosystem, encompassing 200+ key papers organized into a unified taxonomy. We systematically cover six major pillars: (i) Foundation World Models, including general-purpose interactive simulators (Genie, Cosmos, Sora) and game-specific environments (Oasis, Matrix-Game); (ii) Vision-Language-Action Models, spanning foundational architectures (RT-2, π₀, OpenVLA), driving-specific VLAs, and embodied manipulation policies; (iii) Embodied World Action Models, unifying video generation and action prediction through zero-shot policies, controllable simulation platforms, and world model-based reinforcement learning; (iv) Autonomous Driving World Models, addressing video generation, closed-loop simulation, planning policies, and geometric occupancy/BEV representations; (v) Efficiency and Evaluation, covering computational acceleration techniques and benchmarking protocols for physical plausibility; and (vi) Datasets and Ecosystems, including large-scale robot learning corpora and industry technical reports that underpin the entire field. Through this organization, we illuminate the evolutionary trajectory from passive pixel predictors to active, reasoning, and action-grounded simulators. We identify critical open challenges—including physical consistency, cross-embodiment generalization, safety verification, and the sim-to-real evaluation gap—and outline future directions toward cognitive world models, autonomous data collection, and standardized open ecosystems. This survey aims to serve as a definitive reference for researchers and practitioners advancing the next generation of embodied intelligence.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xin Jin
Building similarity graph...
Analyzing shared references across papers
Loading...
Xin Jin (Wed,) studied this question.
www.synapsesocial.com/papers/6a05685ca550a87e60a20ede — DOI: https://doi.org/10.5281/zenodo.20130369
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: