March 28, 2024Open Access

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

The surge of Multimodal Large Language Models (MLLMs), given their prominent emergent capabilities in instruction following and reasoning, has greatly advanced the field of visual reasoning. However, constrained by their non-lossless image tokenization, most MLLMs fall short of comprehensively capturing details of text and objects, especially in high-resolution images. To address this, we propose P2G, a novel framework for plug-and-play grounding of reasoning in MLLMs. Specifically, P2G exploits the tool-usage potential of MLLMs to employ expert agents to achieve on-the-fly grounding to critical visual and textual objects of image, thus achieving deliberate reasoning via multimodal prompting. We further create P2GB, a benchmark aimed at assessing MLLMs' ability to understand inter-object relationships and text in challenging high-resolution images. Comprehensive experiments on visual reasoning tasks demonstrate the superiority of P2G. Noteworthy, P2G achieved comparable performance with GPT-4V on P2GB, with a 7B backbone. Our work highlights the potential of plug-and-play grounding of reasoning and opens up a promising alternative beyond model scaling.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Chen et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e71fddb6db6435876996ed — DOI: https://doi.org/10.48550/arxiv.2403.19322

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Latent Visual Reasoning· 2025
Exploration of Stability Judgments: From Multimodal LLMs to Human Insights· 2025
Exploration of Stability Judgments: Assessing Multimodal LLMs in Game-Inspired Physical Reasoning Tasks· 2025
How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding· 2025 · 1 citations
NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision· 2024 · 1 citations

Authors

Jiaxing Chen

Yuxuan Liu

Dehu Li

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion