What type of study is this?

This is a Literature Review study.

October 1, 2025

A Survey on Video Temporal Grounding with Multimodal Large Language Model

Key Points

VTG-MLLMs outperform traditional methods in competitive performance and generalization across various scenarios.
The survey identifies three key aspects: functional roles of MLLMs, training paradigms, and video feature processing techniques.
Benchmark datasets and evaluation protocols are discussed, along with empirical findings in the field.
Future directions for research are proposed, detailing limitations in current approaches to VTG.

Abstract

The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jianlong Wu

Wei Liu

Ye Liu

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

Peking University

Hong Kong Polytechnic University

Harbin Institute of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Survey on Video Temporal Grounding with Multimodal Large Language Model

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study