What question did this study set out to answer?

This research explores the challenge of understanding human intent in videos through a new task named IntentQA.

May 8, 2026

IntentQA: Intent Question Answering in Videos by Cognitive Context Reasoning

Puntos clave

This research explores the challenge of understanding human intent in videos through a new task named IntentQA.
Developed a large-scale VideoQA dataset for intent reasoning
Introduced the X-CaVIR framework to integrate situational, contrastive, and commonsense contexts
Evaluated model performance using a novel Contrast Performance Decline metric and contrast sets generated from LLMs.
X-CaVIR significantly outperforms existing state-of-the-art models in video analysis
Demonstrated robustness against dataset perturbations
Enhanced interpretability of reasoning processes through transparent integration with LLMs.

Resumen

Video understanding requires intelligent agents to transcend mere recognition of visual facts and comprehend the underlying intents behind human actions-often termed the "dark matter" of social intelligence. To bridge the gap between visual observation and intent reasoning, we introduce a novel task, IntentQA, and contribute a large-scale VideoQA dataset specifically tailored for this purpose. However, recognizing that standard metrics may overestimate capabilities due to dataset biases, we go beyond simple accuracy to rigorously evaluate model robustness. We augment the benchmark by generating five distinct contrast sets via Large Language Models (LLMs) and introducing a "Contrast Performance Decline" metric. We propose the X-CaVIR (eXplainable Context-aware Video Intent Reasoning) framework, which leverages three types of "Cognitive Context" to enhance video analysis: i) Situational Context via a cross-modal Video Query Language (VQL) module, ii) Contrastive Context via a Contrastive Learning module, and iii) Commonsense Context via a Commonsense Reasoning module. Crucially, to overcome the opacity of traditional black-box models, we refine the integration of LLMs within X-CaVIR by employing a transparent pipeline that synergizes video captions with VQA model outputs. This approach not only improves performance by effectively utilizing rich commonsense knowledge but also renders the reasoning process explicitly interpretable. Extensive experiments demonstrate the effectiveness of our components, the superiority of X-CaVIR over state-of-the-art baselines, and its stability against perturbations on the contrast sets. The dataset and codes are open-sourced at: https://github.com/JoseponLee/IntentQA.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

J H Li

Ping Wei

Wenjuan Han

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

Xi'an Jiaotong University

Beijing Jiaotong University

Beijing Academy of Artificial Intelligence

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

IntentQA: Intent Question Answering in Videos by Cognitive Context Reasoning

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study