What type of study is this?

This is a Quantitative Study study.

October 12, 2025Open Access

Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

Puntos clave

The proposed framework significantly enhances visual grounding performance by addressing alignment issues between modalities.
Utilizing ALBEF for robust cross-modal alignment, the method improves the representation of salient objects.
Introducing a prototype discovering mechanism facilitates effective recognition of novel object categories.
Experiments demonstrate state-of-the-art results in open-vocabulary scenes across five benchmark datasets.

Resumen

Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jiapeng Xie

Xiaolong Zheng

Liang Zheng

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider