What type of study is this?

September 10, 2025Open Access

Global and Local Semantic Completion Learning for Vision-Language Pre-Training

Key Points

GLSCL task enhances both global-local alignment and local-local alignment in vision-language models.
Masked global semantic completion improves the representativeness of global features, boosting model performance.
New model achieves state-of-the-art results in vision-language benchmarks, such as image-text retrieval.
ALIGN-BENCH serves as a validation benchmark for assessing cross-modal alignment efficacy.

Abstract

Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment, i.e., associations between image patches and text tokens. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Tu et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68c1b81854b1d3bfb60ec46a — DOI: https://doi.org/10.1109/tpami.2025.3596394

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition· 2024 · 14 citations
“Cloze Procedure”: A New Tool for Measuring Readability· 1953 · 2,338 citations
Vision-Language Pre-Training with Triple Contrastive Learning· 2022 · 268 citations
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval· 2022 · 32 citations
Clover: Towards A Unified Video-Language Alignment and Fusion Model

Authors

Rong-Cheng Tu

Yatai Ji

Jie Jiang

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

Tsinghua University

Nanyang Technological University

Tencent (China)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Global and Local Semantic Completion Learning for Vision-Language Pre-Training

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion