October 19, 2025Open Access

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Key Points

Models exhibit only 60% to 70% accuracy in understanding complex human language, indicating limitations.
MMLA features over 61K multimodal utterances, encompassing intent, emotion, and communication behavior studies.
Evaluation of LLMs and MLLMs used zero-shot inference, supervised fine-tuning, and instruction tuning methodologies.
MMLA serves as a foundational resource for advancing multimodal language analysis and exploring model capabilities.

Abstract

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhang et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68f43f09854d1061a58ac9f6 — DOI: https://doi.org/10.48550/arxiv.2504.16427

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

A Survey of Multimodal Large Language Model from A Data-centric Perspective· 2024 · 4 citations
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models· 2024 · 1 citations
Do Multimodal Large Language Models and Humans Ground Language Similarly?· 2024 · 3 citations
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms· 2024 · 1 citations
Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

Authors

Hanlei Zhang

Zhuohang Li

Yeshuang Zhu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion