UAV-based object detection faces critical challenges including extreme scale variations (targets occupy 0.1–2% image area), bird’s-eye view complexities, and all-weather operational demands. Single RGB sensors degrade under poor illumination while infrared sensors lack spatial details. We propose ATM-Net, a lightweight multimodal RGB–infrared fusion network for robust UAV vehicle detection. ATM-Net integrates three innovations: (1) Asymmetric Recurrent Fusion Module (ARFM) performs “extraction→fusion→separation” cycles across pyramid levels, balancing cross-modal collaboration and modality independence. (2) Tri-Dimensional Attention (TDA) recalibrates features through orthogonal Channel-Width, Height-Channel, and Height-Width branches, enabling comprehensive multi-dimensional feature enhancement. (3) Multi-scale Adaptive Feature Pyramid Network (MAFPN) constructs enhanced representations via bidirectional flow and multi-path aggregation. Experiments on VEDAI and DroneVehicle datasets demonstrate superior performance—92.4% mAP50 and 64.7% mAP50-95 on VEDAI, 83.7% mAP on DroneVehicle—with only 4.83M parameters. ATM-Net achieves optimal accuracy–efficiency balance for resource-constrained UAV edge platforms.
Chen et al. (Tue,) studied this question.