What question did this study set out to answer?

The aim is to enhance video temporal grounding by effectively aligning video content with textual descriptions.

April 3, 2026Open Access

Hierarchical Prototype Alignment for Video Temporal Grounding

Key Points

The aim is to enhance video temporal grounding by effectively aligning video content with textual descriptions.
Propose a hierarchical prototype alignment approach for video-text modeling.
Decompose the alignment into object-phrase and event-sentence alignment stages.
Aggregate local visual regions and textual words to construct distinct prototypes.
Integrate object prototypes along the temporal dimension for creating event prototypes.
Inject cross-modal alignment information into candidate moment representations.
Outperforms existing methods on key datasets like Charades-STA and ActivityNet Captions.
Enhances both cross-modal alignment quality and temporal grounding accuracy significantly.

Abstract

Recent advances in vision-language cross-modal learning have substantially improved the performance of video temporal grounding. However, most existing methods directly associate global video features with sentence-level features, overlooking the fact that textual semantics usually correspond to only limited spatio-temporal regions within a video. This limitation often leads to unstable alignment in complex scenarios involving intertwined events and diverse actions. In essence, accurate video temporal grounding requires the joint modeling of fine-grained spatial semantics and heterogeneous temporal event structures. Motivated by this observation, we propose a hierarchical prototype alignment approach that models cross-modal correspondence between video and text through structured intermediate prototype representations. Specifically, the alignment process is decomposed into two complementary stages: object-phrase alignment and event-sentence alignment. In the object-phrase alignment stage, discriminative local visual regions and informative textual words are aggregated to construct object and phrase prototypes, thereby enhancing fine-grained spatial correspondence at the level of entities and localized actions. In the event-sentence alignment stage, object prototypes are further integrated along the temporal dimension to form event prototypes that represent continuous action units, enabling effective alignment with sentence-level semantics and facilitating the modeling of diverse temporal event structures. On this basis, we further directly inject cross-modal alignment information into candidate moment aggregation. This design allows candidate moment representations to emphasize query-relevant temporal regions. Extensive experiments on Charades-STA, ActivityNet Captions, and TACoS demonstrate that the proposed method outperforms existing approaches, validating the effectiveness of hierarchical prototype alignment for improving both cross-modal alignment quality and temporal grounding accuracy.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yun Tian

Xiaobo Guo

Jinsong Wang

Journals

Entropy

Actions

Institutions

Chinese Academy of Sciences

Shenzhen Institutes of Advanced Technology

Changchun University of Science and Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Hierarchical Prototype Alignment for Video Temporal Grounding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study