High-resolution remote sensing image (HRRSI) scene classification faces challenges such as significant target scale variations, complex background interference, and the difficult spatial parsing of dense objects (such as tightly packed buildings in dense residential areas or scattered aircraft on aprons), while existing models struggle to balance computational efficiency and classification accuracy. To address these issues, this paper proposes a lightweight Multi-Scale Selective Dilated Attention Residual Network (MS-DARNet). The model utilizes a Multi-branch Dilated Feature Extraction (MDFE) module, employing parallel convolutional branches with varying dilation rates to dynamically expand the receptive field and collaboratively extract multi-scale features without increasing parameter counts. Furthermore, a Context-Position Aware Attention (CPAA) module is introduced, combining a large kernel decomposition strategy to suppress irrelevant background noise with direction-aware feature aggregation to retain precise spatial coordinates for dense objects. Extensive experiments on the AID, NWPU-RESISC45, and RSD-WHU46 datasets show that MS-DARNet achieves superior classification accuracies of 97.78%, 94.53%, and 94.55%, respectively. Concurrently, it maintains a significantly low complexity of just 2.50 M parameters and 0.5940 GMACs. These findings demonstrate that MS-DARNet effectively achieves an optimal balance between lightweight architecture and exceptional classification performance for complex remote sensing scenes.
Huang et al. (Sun,) studied this question.