Fine-grained land classification provides low-cost and efficient data support for land resource monitoring, agricultural assessment, and ecological protection. However, classification methods based on field-captured images often suffer from performance limitations due to challenges such as the complexity of land categories, variations in illumination and viewing angles, seasonal changes, and the absence of spatiotemporal metadata. Addressing the characteristics of significant seasonal variations in crops and strong correlations between adjacent land parcel categories, this paper proposes a robust Spatiotemporal Fusion Network (STF-Net) for fine-grained land classification by fusing images with spatiotemporal metadata. The main components of STF-Net include a visual backbone network, a spatiotemporal metadata encoder, a cross-modal multi-head attention fusion module, and a fallback branch designed for cases where metadata is missing. The model is robust to missing metadata and adaptable to different visual backbone networks such as the Swin Transformer and EfficientNet. Experiments on a dataset containing 91 categories of land use photos show that STF-Net achieves an overall accuracy of 93.54% and an F1-score of 0.92, significantly outperforming baseline models. Ablation studies further validate the necessity of fusing spatiotemporal metadata.
Ye et al. (Fri,) studied this question.