Key points are not available for this paper at this time.
The success of large language models has inspired researchers to transfer their exceptional representing ability to other modalities. Several recent works leverage image-caption alignment datasets to train multimodal large language models (MLLMs), which achieve state-of-the-art performance on image-to-text tasks. However, there are very few studies exploring whether MLLMs truly understand the complete image information, i.e., global information, or if they can only capture some local object information. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models for local semantic representation through object detection tasks. And we draw a conclusion that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tao et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68e77797b6db6435876ec069 — DOI: https://doi.org/10.48550/arxiv.2402.17304
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Mingxu Tao
Quzhe Huang
Kun Xu
Building similarity graph...
Analyzing shared references across papers
Loading...