Anti-unmanned aerial vehicle (Anti-UAV) detection is critical for airspace security, yet existing single-modality approaches suffer from severe performance degradation under adverse illumination, thermal crossover, and extreme scale variation. In this paper, we propose CSFADet, a dual-modal detection framework that jointly exploits visible and infrared imagery through four tightly integrated modules. First, a Cross-Spectral Feature Alignment (CSFA) module performs early-stage spectral calibration by computing cross-modal query–value attention maps, generating modality-aware channel descriptors that re-weight and concatenate the two spectral streams. Second, a Dual-path Texture Enhancement Module (DTEM) enriches fine-grained spatial details via cascaded convolutions with residual connections. Third, a Dual-path Cross-Attention Module (DCAM) introduces a feature-shrinking token generation strategy followed by symmetric cross-attention branches with learnable scaling factors, Squeeze-and-Excitation recalibration, and a 1×1 convolution fusion head, enabling deep bidirectional interaction between modalities. Fourth, a Dual-path Information Refinement Module (DIRM) embeds Adaptive Residual Groups (ARGs) that cascade Multi-modal Spatial Attention Blocks (MSABs) with channel and dynamic spatial attention, culminating in a Multi-scale Scale-aware Fusion Refinement (MSFR) unit that employs three parallel multi-head attention branches with a Scale Reasoning Gate and Channel Fusion Layer to produce scale-discriminative enhanced features. Experiments on the public Anti-UAV300 benchmark show that CSFADet achieves 91.4% mAP@0.5 and 58.7% mAP@0.5:0.95, surpassing fifteen representative detectors spanning single-stage, two-stage, YOLO-family, and Transformer-based categories. Ablation studies confirm the complementary contributions of each module, and heatmap visualizations verify the model’s capacity to focus on small, distant UAV targets under challenging conditions.
Yuan et al. (Thu,) studied this question.