Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired video streams, linguistic ambiguity, and discrepancies in modality-specific representations. Most existing approaches rely on intra-modal feature modeling, processing video and text independently throughout the representation learning stage. However, this isolation undermines semantic alignment by neglecting the potential of cross-modal interactions. In practice, a natural language query typically corresponds to spatiotemporal content in video signals collected through camera-based sensing systems, encompassing a particular sequence of frames and its associated salient subregions. We propose a text-guided visual representation optimization framework tailored to enhance semantic interpretation over video signals captured by visual sensors. This framework leverages textual information to focus on spatiotemporal video content, thereby narrowing the cross-modal gap. Built upon the unified cross-modal embedding space provided by CLIP, our model leverages video data from sensing devices to structure representations and introduces two dedicated modules to semantically refine visual representations across spatial and temporal dimensions. First, we design a Spatial Visual Representation Optimization (SVRO) module to learn spatial information within intra-frames. It selects salient patches related to the text, capturing more fine-grained visual details. Second, we introduce a Temporal Visual Representation Optimization (TVRO) module to learn temporal relations from inter-frames. Temporal triplet loss is employed in TVRO to enhance attention on text-relevant frames and capture clip semantics. Additionally, a self-supervised contrastive loss is introduced at the clip–text level to improve inter-clip discrimination by maximizing semantic variance during training. Experiments on Charades-STA, ActivityNet Captions, and TACoS, widely used benchmark datasets, demonstrate that our method outperforms state-of-the-art methods across multiple metrics.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yun Tian
Guo Xiao-bo
Jinsong Wang
Sensors
Changchun University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Tian et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68c19fa854b1d3bfb60db855 — DOI: https://doi.org/10.3390/s25154704
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: