This paper explores the capability of Large Language Models (LLMs) to perform zero-shot planning through multimodal reasoning, with a particular emphasis on applications to Autonomous Mobile Robots (AMR) and unmanned systems. We present a modular system architecture that integrates a general-purpose LLM with visual and spatial inputs for adaptive planning to iteratively guide robot behavior. To assess performance, we employ a continuous evaluation metric that jointly considers distance and orientation, offering a more informative and fine-grained alternative to binary success measures. We evaluate three foundational LLMs (i.e., GPT-4.1-nano, GPT-4o-mini, and Gemini 2.0 Flash) on a suite of zero-shot navigation and exploration tasks in simulated environments. Our findings show that LLMs exhibit encouraging signs of goal-directed spatial planning and partial task completion, even in a zero-shot setting. However, inconsistencies in plan generation across models highlight the need for task-specific adaptation or fine-tuning. The findings support the use of multimodal inputs as key enablers for advancing LLM-based autonomy in AMR and unmanned systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Olaiya et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68ed3352c8c3d6f5ff5dd904 — DOI: https://doi.org/10.20944/preprints202510.0846.v1
Kelvin Olaiya
Giovanni Delnevo
Chan–Tong Lam
Building similarity graph...
Analyzing shared references across papers
Loading...