Multi-view soccer foul recognition aims to classify foul actions and their severity by leveraging multi-view video data. However, existing methods struggle with effective multi-view feature fusion and spatial-temporal modeling, limiting their ability to accurately interpret critical actions. To address these limitations, we propose the Spatial-Temporal Query Network (STQNet), a novel method for multi-view soccer foul recognition that improves both foul action classification and severity estimation. First, we adapt the existing Vision Transformer encoder for extracting multi-view spatial-temporal embeddings. Then, we introduce a dual-branch spatial-temporal query decoder that utilizes learnable action and severity queries to search for foul cues from the corresponding visual embeddings. Finally, dual classification heads are employed to predict foul action and severity. Experimental results on the SoccerNet-MVFouls dataset demonstrate that STQNet outperforms existing methods with superior performance.
PU et al. (Thu,) studied this question.