Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task‐specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high‐level reasoning with low‐level execution across heterogeneous agents. To address this, a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine‐tuned vision‐language model (VLM) is proposed. At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine‐grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target‐absent or ambiguous scenarios. The framework is validated through extensive simulation and real‐world experiments on long‐horizon object arrangement tasks, demonstrating zero‐shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial–ground robotic system that integrates VLM‐based perception with LLM‐driven reasoning for global high‐level task planning and execution.
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68ff87e2c8c50a61f2bdd0cc — DOI: https://doi.org/10.1002/aisy.202500640
Haokun Liu
Zheng Ma
Yunong Li
Advanced Intelligent Systems
The University of Tokyo
Building similarity graph...
Analyzing shared references across papers
Loading...