What type of study is this?

This is a Quantitative Study study.

October 2, 2025Open Access

Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

Key Points

Multimodal large language models significantly enhance sarcasm detection across languages, showcasing their potential for understanding context.
Audio-based models demonstrated the highest unimodal performance, while text-audio combinations outperformed individual modalities.
Systematic evaluations were conducted using existing datasets, MUStARD++ and MCSD 1.0, under zero-shot and few-shot conditions.
Findings indicate a clear advantage of integrating audio, text, and visual cues for effective sarcasm detection and understanding.

Abstract

Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhu et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68de5da283cbc991d0a20a05 — DOI: https://doi.org/10.48550/arxiv.2509.15476

Authors

Li Zhu

Xiyuan Gao

Yuqing Zhang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion