Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68f43f09854d1061a58ac9f6 — DOI: https://doi.org/10.48550/arxiv.2504.16427
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Hanlei Zhang
Zhuohang Li
Yeshuang Zhu
Building similarity graph...
Analyzing shared references across papers
Loading...