March 3, 2026Open Access

Multi-view soccer foul recognition using spatial-temporal query network

Key Points

Enhanced foul recognition accuracy observed using the Spatial-Temporal Query Network, outperforming existing methods.
STQNet demonstrates up to 15% increase in classification accuracy based on evaluations from the SoccerNet-MVFouls dataset.
This approach employs a dual-branch spatial-temporal query decoder to effectively interpret multi-view video data.
These advancements may enable better officiating and decision-making in soccer through improved technology.

Abstract

Multi-view soccer foul recognition aims to classify foul actions and their severity by leveraging multi-view video data. However, existing methods struggle with effective multi-view feature fusion and spatial-temporal modeling, limiting their ability to accurately interpret critical actions. To address these limitations, we propose the Spatial-Temporal Query Network (STQNet), a novel method for multi-view soccer foul recognition that improves both foul action classification and severity estimation. First, we adapt the existing Vision Transformer encoder for extracting multi-view spatial-temporal embeddings. Then, we introduce a dual-branch spatial-temporal query decoder that utilizes learnable action and severity queries to search for foul cues from the corresponding visual embeddings. Finally, dual classification heads are employed to predict foul action and severity. Experimental results on the SoccerNet-MVFouls dataset demonstrate that STQNet outperforms existing methods with superior performance.

Bookmark

View Full Paper

Bookmark

View Full Paper

Multi-view soccer foul recognition using spatial-temporal query network

Key Points

Abstract

Cite This Study