What question did this study set out to answer?

The research aims to improve video understanding by modeling object-level temporal relations and enhancing interpretability.

April 15, 2026

Multi-modal Temporal Relation Network for Video Understanding

Key Points

The research aims to improve video understanding by modeling object-level temporal relations and enhancing interpretability.
Developed VideoU-MTR framework for temporal relation modeling.
Introduced query-oriented frame identification combining visual and textual attention.
Implemented a temporal relation module for capturing inter-object dynamics.
Employed cross-modal alignment between visual features and linguistic semantics.
Achieved superior performance on eight benchmarks compared to state-of-the-art methods.
Enhanced interpretability through effective temporal information integration.
Demonstrated improved fine-grained representations of object interactions.

Abstract

Video understanding endeavors to generate descriptive texts by analyzing structured semantics from dynamic visual sequences, thus facilitating context-aware reasoning and interpretation. Recent advancements primarily rely on patch-level visual-textual alignment, bridging the gap between visual and textual modalities and therefore enabling more comprehensive reasoning. While promising, they often struggle to capture object-level semantics and temporal dependencies, resulting in limited interpretability and suboptimal compositional understanding. To address these issues, we propose a novel temporal relation framework for multi-modal video understanding, dubbed VideoU-MTR, which explicitly models object-level temporal relations to facilitate fine-grained and coherent representations of cross-frame object interactions. Specifically, we introduce a query-oriented frames identification mechanism that synergistically combines textual and visual attention, allowing the model to dynamically attend to semantically relevant video content across hierarchical levels while effectively filtering out irrelevant information. Furthermore, we employ an explicit temporal relation module to capture fine-grained temporal dependencies and inter-object dynamics by modeling object-centric sequences with time-aware attention and frame-level embeddings. Additionally, we propose a cross-modal alignment adapter that aligns temporally contextualized visual features with linguistic semantics at both object and frame levels. Extensive experiments on eight benchmarks across video question answering (VideoQA), long-term video understanding (LTVU), and video captioning (VideoCap) benchmarks, demonstrates that VideoU-MTR achieves superior performance compared to state-of-the-art methods. Moreover, visualization analysis further validates the effectiveness of incorporating temporal information for enhancing video comprehension.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zhixuan Wu

Quanxing Zha

Shanshan Zhao

Journals

ACM Transactions on Multimedia Computing Communications and Applications

Actions

Institutions

Southern University of Science and Technology

Beijing University of Posts and Telecommunications

Institute of Computing Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Multi-modal Temporal Relation Network for Video Understanding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study