Key points are not available for this paper at this time.
In recent years, vision–language models (VLMs) have been introduced into remote sensing semantic segmentation to provide richer semantic representations through visual–textual alignment. However, most existing VLM-based segmentation methods focus on global semantic alignment while neglecting pixel-level local neighborhood features, which are crucial for reliably understanding remote sensing imagery with high spatial resolution, complex structures, and strong spatial continuity. To address this issue, we propose LoVLANet (Localized Vision–Language Attention Network), a novel vision–language segmentation framework that integrates language-driven global semantics with local spatial context. LoVLANet consists of a text encoder, a visual encoder, and a segmentation decoder. Specifically, the text encoder is inherited from RemoteCLIP to preserve domain-adapted vision–language alignment. The visual encoder is built upon a Vision Transformer (ViT). To enhance local dependency modeling, we propose a Neighborhood Key–Key Encoder. It leverages a Gaussian-weighted neighborhood matrix for spatial correlation and uses key–key similarity to emphasize intrinsic semantic similarity over query-driven features, thus, preserving spatial consistency. Finally, the segmentation decoder fuses multi-scale visual features and aligns the image–text representations to generate accurate pixel-level segmentation results. Experiments on RGB remote sensing benchmarks, including LoveDA and GID, show that LoVLANet achieves competitive segmentation performance under the adopted experimental settings, with improved mIoU and clearer boundary delineation in qualitative visualizations. These results suggest the effectiveness of explicitly modeling local neighborhood relationships in VLM-based segmentation for supervised remote sensing scene understanding.
Zeng et al. (Tue,) studied this question.