Multi-modal sentiment analysis (MSA) aims to accurately identify users’ emotional states by integrating textual, acoustic, and visual modalities. However, existing methods often suffer from insufficient cross-modal interaction, rigid fusion strategies, and limited sensitivity to subtle sentiment-level differences, which severely restrict model generalization and robustness. To address these issues, this paper proposes CrossSent, a multi-modal sentiment analysis framework that combines cross-modal attention with pairwise ranking regularization. Specifically, a Gated Multi-modal Residual Adapter (GMRA) is introduced to dynamically integrate heterogeneous features through gated residual connections, effectively mitigating modality asynchrony and noise interference. Meanwhile, a Monotonic Pairwise Ranking (MPR) regularization enhances discrimination among fine-grained sentiment levels. Furthermore, an Error-Interval Ordinal Inconsistency (EIOI) loss is designed to tolerate small prediction deviations, improving both stability and robustness. Experimental results on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate that CrossSent consistently surpasses state-of-the-art baselines across key metrics. For instance, it achieves 89.78% binary accuracy and 52.1% seven-class accuracy on CMU-MOSI, 87.72% and 54.7% on CMU-MOSEI, and 80.41%, 62.36%, and 43.54% for three- and five-level CH-SIMS tasks, with reduced mean absolute errors of 0.563, 0.513, and 0.408, respectively. We further report ordinal-consistency measures (QWK and level-jump statistics) to complement conventional metrics and quantify level-wise agreement. These results validate the effectiveness and generalization capability of the proposed framework.
Liu et al. (Wed,) studied this question.