What question did this study set out to answer?

The research aims to enhance weakly supervised temporal action localization by improving action completeness and accuracy.

April 17, 2026

Semantic Prototype Guided Sparse Temporal Interaction for Weakly Supervised Temporal Action Localization

Key Points

The research aims to enhance weakly supervised temporal action localization by improving action completeness and accuracy.
Utilized semantic prototypes to enrich video representations.
Employed a prototype contrastive loss for better feature discriminability.
Designed a sparse temporal interaction unit to model short-term context and long-range dependencies.
Applied a boundary-guided loss for precise action boundaries.
S2Net achieved more accurate action localization compared to previous methods.
Demonstrated improved completeness of action segment detection on THUMOS14 and ActivityNet1.3.

Abstract

Weakly supervised temporal action localization (WTAL) aims to detect action segments in untrimmed videos only using video-level labels. Existing methods typically follow the multi-instance learning (MIL) paradigm with a top-k strategy, often resulting in incomplete action localization. Moreover, the local and discontinuous nature of actions causes action segments to be isolated and lack sufficient temporal interaction. To address these issues, this paper introduces semantic prototypes to enrich video representations, enabling the model to aggregate category-level action cues across videos and recover semantically relevant but weakly activated segments, thereby improving action completeness. A prototype contrastive loss is further employed to improve feature discriminability. Moreover, a sparse temporal interaction unit is designed to jointly model short-term context and long-range dependencies. The boundary-guided loss utilizes the temporal interaction outputs to explicitly constrain semantic responses around action boundaries, promoting sharp and temporally consistent transitions. Based on these, this paper proposes a semantic prototype guided sparse temporal interaction network (S2Net), achieving a unified video modeling from full semantic understanding to fine-grained boundary perception. Extensive experiments on THUMOS14 and ActivityNet1.3 demonstrate that S2Net achieves more accurate and complete action localization.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yì Wáng

Dehui Kong

Jinghua Li

Journals

ACM Transactions on Multimedia Computing Communications and Applications

Actions

Institutions

Beijing Academy of Artificial Intelligence

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Semantic Prototype Guided Sparse Temporal Interaction for Weakly Supervised Temporal Action Localization

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study