What type of study is this?

September 10, 2025Open Access

Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding

Key Points

The method enhances semantic interpretation by optimizing video representations according to text inputs.
Spatial Visual Representation Optimization selects salient visual patches, while Temporal Visual Representation Optimization enhances attention on relevant frames.
Experimentation shows improved results on benchmark datasets like Charades-STA, surpassing previous methods in various metrics.
The approach addresses cross-modal challenges by leveraging both spatial and temporal aspects of video data.

Abstract

Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired video streams, linguistic ambiguity, and discrepancies in modality-specific representations. Most existing approaches rely on intra-modal feature modeling, processing video and text independently throughout the representation learning stage. However, this isolation undermines semantic alignment by neglecting the potential of cross-modal interactions. In practice, a natural language query typically corresponds to spatiotemporal content in video signals collected through camera-based sensing systems, encompassing a particular sequence of frames and its associated salient subregions. We propose a text-guided visual representation optimization framework tailored to enhance semantic interpretation over video signals captured by visual sensors. This framework leverages textual information to focus on spatiotemporal video content, thereby narrowing the cross-modal gap. Built upon the unified cross-modal embedding space provided by CLIP, our model leverages video data from sensing devices to structure representations and introduces two dedicated modules to semantically refine visual representations across spatial and temporal dimensions. First, we design a Spatial Visual Representation Optimization (SVRO) module to learn spatial information within intra-frames. It selects salient patches related to the text, capturing more fine-grained visual details. Second, we introduce a Temporal Visual Representation Optimization (TVRO) module to learn temporal relations from inter-frames. Temporal triplet loss is employed in TVRO to enhance attention on text-relevant frames and capture clip semantics. Additionally, a self-supervised contrastive loss is introduced at the clip–text level to improve inter-clip discrimination by maximizing semantic variance during training. Experiments on Charades-STA, ActivityNet Captions, and TACoS, widely used benchmark datasets, demonstrate that our method outperforms state-of-the-art methods across multiple metrics.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yun Tian

Guo Xiao-bo

Jinsong Wang

Journals

Sensors

Actions

Institutions

Changchun University of Science and Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider