April 10, 2024Open Access

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Key Points

Key points are not available for this paper at this time.

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Ranasinghe et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e6fb90b6db643587675f5f — DOI: https://doi.org/10.48550/arxiv.2404.07449

Authors

Kanchana Ranasinghe

Satya Narayan Shukla

Omid Poursaeed

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider