February 17, 2024Open Access

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Key Points

Key points are not available for this paper at this time.

Abstract

Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Long et al. (Sat,) studied this question.

www.synapsesocial.com/papers/68e78cdeb6db6435876feada — DOI: https://doi.org/10.48550/arxiv.2402.11435

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Qian Long

Juncheng Li

Yu Wu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion