Weakly supervised temporal action localization (WTAL) aims to detect action segments in untrimmed videos only using video-level labels. Existing methods typically follow the multi-instance learning (MIL) paradigm with a top-k strategy, often resulting in incomplete action localization. Moreover, the local and discontinuous nature of actions causes action segments to be isolated and lack sufficient temporal interaction. To address these issues, this paper introduces semantic prototypes to enrich video representations, enabling the model to aggregate category-level action cues across videos and recover semantically relevant but weakly activated segments, thereby improving action completeness. A prototype contrastive loss is further employed to improve feature discriminability. Moreover, a sparse temporal interaction unit is designed to jointly model short-term context and long-range dependencies. The boundary-guided loss utilizes the temporal interaction outputs to explicitly constrain semantic responses around action boundaries, promoting sharp and temporally consistent transitions. Based on these, this paper proposes a semantic prototype guided sparse temporal interaction network (S2Net), achieving a unified video modeling from full semantic understanding to fine-grained boundary perception. Extensive experiments on THUMOS14 and ActivityNet1.3 demonstrate that S2Net achieves more accurate and complete action localization.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yì Wáng
Dehui Kong
Jinghua Li
ACM Transactions on Multimedia Computing Communications and Applications
Beijing Academy of Artificial Intelligence
Building similarity graph...
Analyzing shared references across papers
Loading...
Wáng et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e1cffa5cdc762e9d8590fc — DOI: https://doi.org/10.1145/3807956