What type of study is this?

This is a Experimental Study study.

September 29, 2025Open Access

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Key Points

Time-R1 significantly elevates performance on temporal video grounding queries through reinforcement learning, achieving state-of-the-art results.
The model shows enhanced generalization capabilities using only 2.5K training data, established through extensive experiments across various datasets.
Data-efficient post-training strategies on a curated RL-friendly dataset enable the model to progressively understand complex video segments.
TVGBench serves as a comprehensive benchmark for evaluating large vision-language models across multiple query types and balanced distributions.

Abstract

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68da58d1c1728099cfd10e97 — DOI: https://doi.org/10.48550/arxiv.2503.13377

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Ye Wang

Ziheng Wang

Boshen Xu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion