Scene text detection aims to accurately localize text instances in images captured under complex environments. Its performance depends heavily on precise text boundary delineation and reliable semantic discrimination from cluttered backgrounds. However, existing methods still struggle in such complex scenes. Repeated downsampling gradually biases features toward low-frequency components, thereby weakening edge details and local structures that are critical to text morphology. Additionally, semantic information and local details are often modeled independently. This lack of coordination makes high-frequency responses vulnerable to background noise. To address these issues, we propose HFI-Former, a Transformer-based model designed for high-frequency enhancement and feature interaction. The framework consists of multi-scale feature extraction, frequency-enhanced representation, semantic-guided feature interaction, and deformable Transformer encoding. Frequency-domain enhancement is introduced to preserve high-frequency structural features degraded by repeated downsampling. Semantic-aware feature interaction further injects global context to regulate multi-scale feature fusion. Experiments on CTW1500, Total-Text and ICDAR1500 demonstrate competitive boundary localization accuracy and strong overall detection performance in complex scenes.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yubing Gao
Quanli Gao
Lianhe Shao
Information
Xi'an Polytechnic University
Building similarity graph...
Analyzing shared references across papers
Loading...
Gao et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c2fe4eeef8a2a6b134d — DOI: https://doi.org/10.3390/info17040365