Vision-and-Language Navigation in continuous environments (VLN-CE) requires an embodied robot to navigate the target destination following the natural language instruction. Most existing methods use panoramic RGB-D cameras for 360° observation of environments. However, these methods struggle in real-world applications because of the higher cost of panoramic RGB-D cameras. This paper studies a low-cost and practical VLN-CE setting, e.g., using monocular cameras of limited field of view, which means "Look Less" for visual observations and environment semantics. In this paper, we propose a ThinkMatter framework for monocular VLN-CE, where we motivate monocular robots to "Think More" by 1) generating novel views and 2) integrating instruction semantics. Specifically, we achieve the former by the proposed 3DGS-based panoramic generation to render novel views at each step, based on past observation collections. We achieve the latter by the proposed enhancement of the occupancy-instruction semantics, which integrates the spatial semantics of occupancy maps with the textual semantics of language instructions. These operations promote monocular robots with wider environment perceptions as well as transparent semantic connections with the instruction. Both extensive experiments in the simulators and real-world environments demonstrate the effectiveness of ThinkMatter, providing a promising practice for real-world navigation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Guangzhao Dai
Shuo Wang
Hao Zhao
IEEE Transactions on Image Processing
Tsinghua University
Nanjing University of Science and Technology
Singapore Management University
Building similarity graph...
Analyzing shared references across papers
Loading...
Dai et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69a75ca4c6e9836116a25ae2 — DOI: https://doi.org/10.1109/tip.2026.3652003
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: