Video understanding endeavors to generate descriptive texts by analyzing structured semantics from dynamic visual sequences, thus facilitating context-aware reasoning and interpretation. Recent advancements primarily rely on patch-level visual-textual alignment, bridging the gap between visual and textual modalities and therefore enabling more comprehensive reasoning. While promising, they often struggle to capture object-level semantics and temporal dependencies, resulting in limited interpretability and suboptimal compositional understanding. To address these issues, we propose a novel temporal relation framework for multi-modal video understanding, dubbed VideoU-MTR, which explicitly models object-level temporal relations to facilitate fine-grained and coherent representations of cross-frame object interactions. Specifically, we introduce a query-oriented frames identification mechanism that synergistically combines textual and visual attention, allowing the model to dynamically attend to semantically relevant video content across hierarchical levels while effectively filtering out irrelevant information. Furthermore, we employ an explicit temporal relation module to capture fine-grained temporal dependencies and inter-object dynamics by modeling object-centric sequences with time-aware attention and frame-level embeddings. Additionally, we propose a cross-modal alignment adapter that aligns temporally contextualized visual features with linguistic semantics at both object and frame levels. Extensive experiments on eight benchmarks across video question answering (VideoQA), long-term video understanding (LTVU), and video captioning (VideoCap) benchmarks, demonstrates that VideoU-MTR achieves superior performance compared to state-of-the-art methods. Moreover, visualization analysis further validates the effectiveness of incorporating temporal information for enhancing video comprehension.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhixuan Wu
Quanxing Zha
Shanshan Zhao
ACM Transactions on Multimedia Computing Communications and Applications
Southern University of Science and Technology
Beijing University of Posts and Telecommunications
Institute of Computing Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Wu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2b2ce4eeef8a2a6b0182 — DOI: https://doi.org/10.1145/3805042