To address modality inconsistency, insufficient intra-modal affective representation, and the limited adaptability of conventional fusion strategies in multimodal sentiment analysis, this study proposes ECA-CMF Net, an Efficient Channel Attention-enhanced Conditional Modulation Fusion network. The framework integrates unified indexing-based preprocessing, heterogeneous feature extraction with ECA, and Conditional Modulation Fusion to improve multimodal representation learning and sentiment classification. Specifically, sample-level alignment and modality-specific standardisation are first applied to textual, visual, and acoustic inputs to reduce distribution shifts and noise interference. Then, heterogeneous encoders extract modality-specific features, while ECA adaptively recalibrates sentiment-relevant channels and suppresses redundant information. Finally, the CMF mechanism generates modulation parameters from joint multimodal context to scale and shift modality features, enabling dynamic cross-modal interaction and contribution adjustment. Experiments on CMU-MOSI and CMU-MOSEI show that ECA-CMF Net achieves ACC/F1 scores of 0.8874/0.8870 and 0.7089/0.7008, respectively. Compared with the strongest reproduced baselines, it improves ACC/F1 by 3.40/3.38 percentage points on CMU-MOSI and 1.53/1.85 percentage points on CMU-MOSEI, demonstrating improved multimodal collaboration, adaptive fusion, and robustness.
Yao et al. (Tue,) studied this question.