What type of study is this?

September 10, 2025

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

Key Points

Most multimodal large language models experience hallucinations, particularly regarding action and scene transitions.
VIDHALLUC, a benchmark with 5,002 videos, reveals critical dimensions where hallucination occurs in model outputs.
Testing of DINO-HEAL shows a significant average performance improvement of 3.02% in mitigating hallucinations.
DINO-HEAL utilizes spatial saliency to adjust visual features and effectively reduce hallucinations during inference.

Abstract

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VIDHALLUC, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VIDHALLUC assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a trainingfree method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VIDHALLUC, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VIDHALLUC benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Chaoyu Li

Eun Woo Im

Pooyan Fazli

Actions

Institutions

Arizona State University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider