The existing methods in multimodal sentiment analysis (MSA) primarily focus on designing sophisticated fusion networks to obtain multimodal representations. However, they may ignore some critical challenges such as interference from task-irrelevant background information and performance bottlenecks arising from inter-modal semantic conflict. To address these issues, this paper proposes a framework named Refined Separated Representation Learning and Conflict Alleviation Network (RSCAN), which aims to extract effective features for the task and fuse them in a conflict-aware way. Specifically, we design a two-layer Refined Separated Representation Learning Network (RSLN) that interactively separates effective features relevant to the prediction task from background redundancies for each modality. The process is reinforced by contrastive learning to alleviate inter-modal heterogeneity and promote feature alignment before and after separation. Additionally, dominated by text modality, the Conflict Alleviation Network (CAN) is designed to explicitly quantify cross-modal conflict and adaptively adjust the contributions of other modalities to generate high-quality fused representations for sentiment prediction, thereby enhancing the model’s robustness in conflicting scenarios. Extensive experiments on three benchmark datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate the effectiveness of our task-oriented feature refinement and conflict-aware fusion strategy in improving overall sentiment analysis performance.
Zheng et al. (Fri,) studied this question.