As a fundamental task in computer vision, homography estimation plays a critical role in many applications, such as image stitching, augmented reality, and multi-view geometry. However, existing methods primarily rely on low-level image visual features to estimate homography, which may produce inevitable alignment distortions. Motivated by the remarkable success of vision-language models in various computer vision tasks, we propose a text-prompt multi-scale integration and detail-aware enhancement network (TMIDE-Net) for homography estimation, where the multi-modal information, i.e. , visual and text information, is introduced to compensate for the insufficient geometric cues in low-texture regions. Specifically, with the help of auxiliary semantic language knowledge extracted by a frozen pretrained CLIP, a text-prompt multi-scale feature integration module is designed to extract and fuse image features and text features. Then, we design a coarse-to-fine homography estimation module to improve homography alignment accuracy in low-texture and illumination-variant regions, where a detail-aware enhancement block is presented to enhance the fine-grained texture representation capability. Finally, a multi-constraint hybrid loss is applied to obtain robust homography estimation in complex scenes. Extensive experiments indicate that the proposed TMIDE-Net outperforms the state-of-the-arts both quantitatively and qualitatively, reducing the average error by approximately 13.5%. • We propose a TMFIM to extract and fuse image features and text features. • We design a CFHEM to improve homography alignment accuracy. • Experimental results show that the TMIDE-Net outperforms other existing methods.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaoting Fan
Ronglu Wang
Shuang Liu
Journal of Visual Communication and Image Representation
Tianjin Normal University
Tianjin University of Commerce
Building similarity graph...
Analyzing shared references across papers
Loading...
Fan et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69fd7ddcbfa21ec5bbf061ca — DOI: https://doi.org/10.1016/j.jvcir.2026.104837