As a fundamental task in computer vision, homography estimation plays a critical role in many applications, such as image stitching, augmented reality, and multi-view geometry. However, existing methods primarily rely on low-level image visual features to estimate homography, which may produce inevitable alignment distortions. Motivated by the remarkable success of vision-language models in various computer vision tasks, we propose a text-prompt multi-scale integration and detail-aware enhancement network (TMIDE-Net) for homography estimation, where the multi-modal information, i.e. , visual and text information, is introduced to compensate for the insufficient geometric cues in low-texture regions. Specifically, with the help of auxiliary semantic language knowledge extracted by a frozen pretrained CLIP, a text-prompt multi-scale feature integration module is designed to extract and fuse image features and text features. Then, we design a coarse-to-fine homography estimation module to improve homography alignment accuracy in low-texture and illumination-variant regions, where a detail-aware enhancement block is presented to enhance the fine-grained texture representation capability. Finally, a multi-constraint hybrid loss is applied to obtain robust homography estimation in complex scenes. Extensive experiments indicate that the proposed TMIDE-Net outperforms the state-of-the-arts both quantitatively and qualitatively, reducing the average error by approximately 13.5%. • We propose a TMFIM to extract and fuse image features and text features. • We design a CFHEM to improve homography alignment accuracy. • Experimental results show that the TMIDE-Net outperforms other existing methods.
Fan et al. (Tue,) studied this question.