Traditional task planning methods often lack generalization in diverse scenarios, while large language models (LLMs), though capable of open-world reasoning, struggle to align with physical environments and robotic systems. To address this limitation, we propose a Hierarchical Multimodal LLMs-Robotics Framework that integrates three modules. The Grounding Module maps multimodal inputs into PDDL representations to provide contextual grounding. The Planning Module generates task sequences using primitive libraries. The Acting Module optimizes and executes primitives on robotic platforms. The framework also explores the role of vague instructions in language-robot interaction, leveraging multimodal grounding to associate natural language with real-world contexts. Extensive experiments across three real-world scenarios, including ablation studies, demonstrate the framework’s effectiveness. The system achieved reliable performance in pick-and-place tasks and showed notable improvements in long-horizon tasks requiring spatial and geometric reasoning. These results indicate that the framework supports adaptive decision-making in complex environments and contributes to bridging the gap between LLMs, robotic systems, and the physical world.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69fed090b9154b0b82877942 — DOI: https://doi.org/10.1049/icp.2026.1888
Bo Zhang
Yahui Gan
Zhigang Wang
IET conference proceedings.
Southeast University
Nantong University
State Council of the People's Republic of China
Building similarity graph...
Analyzing shared references across papers
Loading...