What question did this study set out to answer?

The aim is to enhance homography estimation by integrating visual and text information to reduce alignment errors.

May 8, 2026Open Access

Text-prompt multi-scale integration and detail-aware enhancement network for homography estimation

Key Points

The aim is to enhance homography estimation by integrating visual and text information to reduce alignment errors.
Developed a text-prompt multi-scale feature integration module to fuse image and text features.
Implemented a coarse-to-fine homography estimation module to improve accuracy in challenging regions.
Applied a multi-constraint hybrid loss to achieve robust homography estimation.
TMIDE-Net reduces average error in homography estimation by approximately 13.5%.
Outperformed state-of-the-art methods both quantitatively and qualitatively.
Showed significant improvement in texture representation capabilities in low-texture areas.

Abstract

As a fundamental task in computer vision, homography estimation plays a critical role in many applications, such as image stitching, augmented reality, and multi-view geometry. However, existing methods primarily rely on low-level image visual features to estimate homography, which may produce inevitable alignment distortions. Motivated by the remarkable success of vision-language models in various computer vision tasks, we propose a text-prompt multi-scale integration and detail-aware enhancement network (TMIDE-Net) for homography estimation, where the multi-modal information, i.e. , visual and text information, is introduced to compensate for the insufficient geometric cues in low-texture regions. Specifically, with the help of auxiliary semantic language knowledge extracted by a frozen pretrained CLIP, a text-prompt multi-scale feature integration module is designed to extract and fuse image features and text features. Then, we design a coarse-to-fine homography estimation module to improve homography alignment accuracy in low-texture and illumination-variant regions, where a detail-aware enhancement block is presented to enhance the fine-grained texture representation capability. Finally, a multi-constraint hybrid loss is applied to obtain robust homography estimation in complex scenes. Extensive experiments indicate that the proposed TMIDE-Net outperforms the state-of-the-arts both quantitatively and qualitatively, reducing the average error by approximately 13.5%. • We propose a TMFIM to extract and fuse image features and text features. • We design a CFHEM to improve homography alignment accuracy. • Experimental results show that the TMIDE-Net outperforms other existing methods.

Text-prompt multi-scale integration and detail-aware enhancement network for homography estimation

Key Points

Abstract

Cite This Study