Autonomous agents that perceive, reason, and interact with the world promise a future where intelligent systems assist with daily activities, support work in homes and factories, and take on labor-intensive or repetitive tasks across diverse environments. Tremendous progress has been made across embodied AI, from self-driving cars and autonomous drones to whole-body locomotion and dexterous manipulation, yet building generalist agents that learn continually, generalize broadly, and operate reliably in open-world settings remains an open challenge. Modern systems face three key limitations: the scarcity and cost of collecting large-scale interactive experience; the difficulty of grounding high-level goals into long-horizon behavior; and the fragmentation of perception, language understanding, and motor control across modular, separately trained components. This dissertation explores a unified and scalable recipe for building foundation agents. The core premise is that internet-scale multimodal data, large language models as the embodied reasoning engine, and unified Vision-Language-Action (VLA) architectures together provide a powerful path toward open-ended embodied intelligence. First, internet-scale multimodal data, including gameplay videos, human activity datasets, online tutorials, and wiki documentation, offers unprecedented breadth, exposing agents to diverse strategies, affordances, and environment configurations that cannot be replicated through controlled robotic data collection alone. Second, large language models provide a flexible cognitive layer for planning, task decomposition, iterative self-improvement, tool use, reward generation, and continual skill acquisition without parameter updates. Third, VLA models unify perception, language grounding, and continuous low-level action, enabling fluid real-time behavior across simulation, gaming environments, and physical robotic platforms, while transferring effectively across task families and embodiments. By integrating these three components, this dissertation advances the development of embodied agents that can draw upon internet-scale knowledge, reason through language, and translate abstract plans into physical behavior. Together, these ideas aim to move us closer to general, open-ended agents that operate with the reliability, adaptability, and competence required to assist humans across a wide range of real-world tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Guanzhi Wang
California Institute of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Guanzhi Wang (Thu,) studied this question.
www.synapsesocial.com/papers/696b2616d2a12237a934963f — DOI: https://doi.org/10.7907/r2ax-s535
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: