To address the limitations of zero-shot generalization in Vision-Language Navigation (VLN), this paper proposes a novel knowledge graph-driven reinforcement learning approach. Our method constructs a hierarchical, dynamically updated knowledge graph online during the agent’s real-time interaction with the environment, seamlessly aligning external semantic priors with continuous visual perception. By leveraging a Chain-of-Thought (CoT) prompting mechanism, the agent performs multi-hop reasoning to precisely locate target objects. Furthermore, we design an end-to-end optimized reinforcement learning framework that fuses multi-modal features and employs a task-oriented composite reward function. Extensive experiments in the AI2-THOR simulation environment demonstrate that the proposed method significantly improves navigation success rates in zero-shot settings. The results validate its robust generalization capabilities, particularly for unseen object categories and complex scene layouts.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ye zhang
Yandong Zhao
He Liu
Mathematics
Taiyuan University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
zhang et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69f594fc71405d493afffee5 — DOI: https://doi.org/10.3390/math14091485