What question did this study set out to answer?

The research aims to develop foundation agents capable of learning continually and generalizing in real-world scenarios.

January 17, 2026Open Access

Building Foundation Agents with Internet Knowledge and Large Language Models

Key Points

The research aims to develop foundation agents capable of learning continually and generalizing in real-world scenarios.
Utilized internet-scale multimodal data including gameplay videos and human activity datasets.
Implemented large language models for cognitive processes like planning and iterative self-improvement.
Developed unified Vision-Language-Action architectures to integrate perception, language, and action.
Demonstrated significant advancements in agents' ability to operate reliably across diverse environments.
Agents showed improved task performance and adaptability by leveraging internet knowledge and language understanding.
Achieved effective knowledge transfer across different tasks and embodiments with the VLA architecture.

Abstract

Autonomous agents that perceive, reason, and interact with the world promise a future where intelligent systems assist with daily activities, support work in homes and factories, and take on labor-intensive or repetitive tasks across diverse environments. Tremendous progress has been made across embodied AI, from self-driving cars and autonomous drones to whole-body locomotion and dexterous manipulation, yet building generalist agents that learn continually, generalize broadly, and operate reliably in open-world settings remains an open challenge. Modern systems face three key limitations: the scarcity and cost of collecting large-scale interactive experience; the difficulty of grounding high-level goals into long-horizon behavior; and the fragmentation of perception, language understanding, and motor control across modular, separately trained components. This dissertation explores a unified and scalable recipe for building foundation agents. The core premise is that internet-scale multimodal data, large language models as the embodied reasoning engine, and unified Vision-Language-Action (VLA) architectures together provide a powerful path toward open-ended embodied intelligence. First, internet-scale multimodal data, including gameplay videos, human activity datasets, online tutorials, and wiki documentation, offers unprecedented breadth, exposing agents to diverse strategies, affordances, and environment configurations that cannot be replicated through controlled robotic data collection alone. Second, large language models provide a flexible cognitive layer for planning, task decomposition, iterative self-improvement, tool use, reward generation, and continual skill acquisition without parameter updates. Third, VLA models unify perception, language grounding, and continuous low-level action, enabling fluid real-time behavior across simulation, gaming environments, and physical robotic platforms, while transferring effectively across task families and embodiments. By integrating these three components, this dissertation advances the development of embodied agents that can draw upon internet-scale knowledge, reason through language, and translate abstract plans into physical behavior. Together, these ideas aim to move us closer to general, open-ended agents that operate with the reliability, adaptability, and competence required to assist humans across a wide range of real-world tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Guanzhi Wang

Actions

Institutions

California Institute of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Building Foundation Agents with Internet Knowledge and Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider