What question did this study set out to answer?

The aim is to improve scene text detection, especially in complex environments with cluttered backgrounds.

April 15, 2026Open Access

HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection

Key Points

The aim is to improve scene text detection, especially in complex environments with cluttered backgrounds.
Developed a Transformer-based model called HFI-Former.
Implemented multi-scale feature extraction to capture various detail levels.
Introduced frequency-domain enhancement to protect high-frequency features from degradation.
Utilized semantic-aware feature interaction for better context regulation in feature fusion.
Achieved competitive boundary localization accuracy on three datasets: CTW1500, Total-Text, and ICDAR1500.
Showed strong overall text detection performance in complex scenes.

Abstract

Scene text detection aims to accurately localize text instances in images captured under complex environments. Its performance depends heavily on precise text boundary delineation and reliable semantic discrimination from cluttered backgrounds. However, existing methods still struggle in such complex scenes. Repeated downsampling gradually biases features toward low-frequency components, thereby weakening edge details and local structures that are critical to text morphology. Additionally, semantic information and local details are often modeled independently. This lack of coordination makes high-frequency responses vulnerable to background noise. To address these issues, we propose HFI-Former, a Transformer-based model designed for high-frequency enhancement and feature interaction. The framework consists of multi-scale feature extraction, frequency-enhanced representation, semantic-guided feature interaction, and deformable Transformer encoding. Frequency-domain enhancement is introduced to preserve high-frequency structural features degraded by repeated downsampling. Semantic-aware feature interaction further injects global context to regulate multi-scale feature fusion. Experiments on CTW1500, Total-Text and ICDAR1500 demonstrate competitive boundary localization accuracy and strong overall detection performance in complex scenes.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yubing Gao

Quanli Gao

Lianhe Shao

Journals

Information

Actions

Institutions

Xi'an Polytechnic University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

HFI-Former: High-Frequency Interaction Transformer for Robust Scene Text Detection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study