October 27, 2025Open Access

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial‐Ground Robotic System

Key Points

Hierarchical task decomposition using language models facilitates robust task navigation.
A proposed framework achieves fine-grained manipulation with a vision-language model during operations.
Real-world experiments validate reliability in challenging scenarios with target-absent conditions.
Adaptive coordination across aerial and ground robots suggests promising applications in complex environments.

Abstract

Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task‐specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high‐level reasoning with low‐level execution across heterogeneous agents. To address this, a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine‐tuned vision‐language model (VLM) is proposed. At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine‐grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target‐absent or ambiguous scenarios. The framework is validated through extensive simulation and real‐world experiments on long‐horizon object arrangement tasks, demonstrating zero‐shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial–ground robotic system that integrates VLM‐based perception with LLM‐driven reasoning for global high‐level task planning and execution.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Liu et al. (Sun,) studied this question.

www.synapsesocial.com/papers/68ff87e2c8c50a61f2bdd0cc — DOI: https://doi.org/10.1002/aisy.202500640

Authors

Haokun Liu

Zheng Ma

Yunong Li

Journals

Advanced Intelligent Systems

Actions

Institutions

The University of Tokyo

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial‐Ground Robotic System

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion