February 27, 2024Open Access

Probing Multimodal Large Language Models for Global and Local Semantic Representation

Key Points

Key points are not available for this paper at this time.

Abstract

The success of large language models has inspired researchers to transfer their exceptional representing ability to other modalities. Several recent works leverage image-caption alignment datasets to train multimodal large language models (MLLMs), which achieve state-of-the-art performance on image-to-text tasks. However, there are very few studies exploring whether MLLMs truly understand the complete image information, i.e., global information, or if they can only capture some local object information. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models for local semantic representation through object detection tasks. And we draw a conclusion that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Tao et al. (Tue,) studied this question.

www.synapsesocial.com/papers/68e77797b6db6435876ec069 — DOI: https://doi.org/10.48550/arxiv.2402.17304

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Mingxu Tao

Quzhe Huang

Kun Xu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Probing Multimodal Large Language Models for Global and Local Semantic Representation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion