Multimodal sarcasm detection involves identifying sarcasm across multiple modalities, with the key challenge being modeling incongruity within and between modalities. Current methods often focus on inter-modal incongruity while underexploring intra-modal semantic information. To address this, we propose the Granularity-Based Inter and Intra-Modal Fusion Network (GIIFN). We leverage pre-trained visual and language models to extract semantic features from images and text, and introduce a learnable granularity grouping module to adaptively partition features into multiple semantic granularities. Furthermore, we design a bidirectional cross-attention mechanism to fuse intra-modal and inter-modal features at each granularity level. Experiments demonstrate that our approach achieves state-of-the-art performance.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mingxuan Chen
China Meteorological Administration
Huarong Tang
Chen Sun
Scientific Reports
Building similarity graph...
Analyzing shared references across papers
Loading...
Chen et al. (Thu,) studied this question.
synapsesocial.com/papers/69b4ba1818185d8a39802ac8 — DOI: https://doi.org/10.1038/s41598-026-43363-5