What question did this study set out to answer?

The aim is to develop an effective land classification method that integrates images with spatiotemporal metadata.

April 12, 2026Open Access

STF-Net: A Robust Fine-Grained Land Classification Method Fusing Images and Spatiotemporal Metadata

Key Points

The aim is to develop an effective land classification method that integrates images with spatiotemporal metadata.
Developed the Spatiotemporal Fusion Network (STF-Net) for fine-grained land classification.
Incorporated a visual backbone network and a spatiotemporal metadata encoder.
Utilized a cross-modal multi-head attention fusion module.
Included a fallback branch for cases with missing metadata.
Evaluated the model's performance on a dataset with 91 categories of land use.
Achieved an overall accuracy of 93.54% in land classification.
Obtained an F1-score of 0.92, outperforming baseline models.
Ablation studies confirmed the importance of fusing spatiotemporal metadata.

Abstract

Fine-grained land classification provides low-cost and efficient data support for land resource monitoring, agricultural assessment, and ecological protection. However, classification methods based on field-captured images often suffer from performance limitations due to challenges such as the complexity of land categories, variations in illumination and viewing angles, seasonal changes, and the absence of spatiotemporal metadata. Addressing the characteristics of significant seasonal variations in crops and strong correlations between adjacent land parcel categories, this paper proposes a robust Spatiotemporal Fusion Network (STF-Net) for fine-grained land classification by fusing images with spatiotemporal metadata. The main components of STF-Net include a visual backbone network, a spatiotemporal metadata encoder, a cross-modal multi-head attention fusion module, and a fallback branch designed for cases where metadata is missing. The model is robust to missing metadata and adaptable to different visual backbone networks such as the Swin Transformer and EfficientNet. Experiments on a dataset containing 91 categories of land use photos show that STF-Net achieves an overall accuracy of 93.54% and an F1-score of 0.92, significantly outperforming baseline models. Ablation studies further validate the necessity of fusing spatiotemporal metadata.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Ye et al. (Fri,) studied this question.

synapsesocial.com/papers/69db38274fe01fead37c650f https://doi.org/https://doi.org/10.3390/electronics15081592

Bookmark

View Full Paper