ABSTRACT Semantic alignment is a key component of image‐text matching in vision‐language research, aiming to accurately measure semantic similarity between images and texts. Most existing approaches exhibit rigid attention mechanisms and static similarity fusion strategies, which fundamentally constrain the model's capacity to establish fine‐grained cross‐modal alignment, ultimately degrading the image‐text matching performance. This work proposes a novel cross‐modal dynamic semantic alignment for image‐matching via progressive optimization (CDSA‐PO). At first, we introduce a context‐aware feature enhancement module to adaptively refine channel‐wise feature weights, reduce noise and enhance fine‐grained semantics. Secondly, we propose a dynamic cross‐modal aligner that adaptively learns modality‐specific attention coefficients for fine‐grained region‐word alignment via iterative optimization, thereby enhancing the granularity and fidelity of cross‐modal corre‐ spondences. Finally, we introduce a progressive similarity integrator to iteratively refine similarity aggregation guided by historical alignment cues. Experiments on Flickr30K and MS‐COCO demonstrate that CDSA‐PO significantly outperforms state‐of‐the‐art baselines in image‐text matching.
Building similarity graph...
Analyzing shared references across papers
Loading...
Liang Zhang
Likai Chong
Rui Shi
IET Image Processing
Hohai University
Nanjing Hydraulic Research Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69df2c01e4eeef8a2a6b0fe7 — DOI: https://doi.org/10.1049/ipr2.70357