June 19, 2024Open Access

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Chen et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e64297b6db6435875d4488 — DOI: https://doi.org/10.48550/arxiv.2406.13763

Authors

Zhawnen Chen

Tianchun Wang

Yizhou Wang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion