What does this research mean for the field?

CrossSent improves multi-modal sentiment analysis by enhancing cross-modal interaction and achieving state-of-the-art performance on sentiment classification tasks. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to enhance multi-modal sentiment analysis by improving cross-modal interactions and sensitivity to subtle sentiment variations.

March 13, 2026Open Access

CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment

Key Points

The aim is to enhance multi-modal sentiment analysis by improving cross-modal interactions and sensitivity to subtle sentiment variations.
Developed CrossSent framework integrating cross-modal attention with pairwise ranking regularization
Introduced Gated Multi-modal Residual Adapter for dynamic feature integration
Implemented Monotonic Pairwise Ranking regularization to improve discrimination among sentiments
Designed Error-Interval Ordinal Inconsistency loss for stability
Achieved 89.78% binary accuracy on CMU-MOSI and 52.1% for seven-class accuracy
Reported improvements of 87.72% and 54.7% on CMU-MOSEI
Attained reductions in mean absolute errors to 0.563, 0.513, and 0.408 for various tasks
Validated effectiveness with ordinal-consistency measures indicating higher agreement among predictions

Abstract

Multi-modal sentiment analysis (MSA) aims to accurately identify users’ emotional states by integrating textual, acoustic, and visual modalities. However, existing methods often suffer from insufficient cross-modal interaction, rigid fusion strategies, and limited sensitivity to subtle sentiment-level differences, which severely restrict model generalization and robustness. To address these issues, this paper proposes CrossSent, a multi-modal sentiment analysis framework that combines cross-modal attention with pairwise ranking regularization. Specifically, a Gated Multi-modal Residual Adapter (GMRA) is introduced to dynamically integrate heterogeneous features through gated residual connections, effectively mitigating modality asynchrony and noise interference. Meanwhile, a Monotonic Pairwise Ranking (MPR) regularization enhances discrimination among fine-grained sentiment levels. Furthermore, an Error-Interval Ordinal Inconsistency (EIOI) loss is designed to tolerate small prediction deviations, improving both stability and robustness. Experimental results on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate that CrossSent consistently surpasses state-of-the-art baselines across key metrics. For instance, it achieves 89.78% binary accuracy and 52.1% seven-class accuracy on CMU-MOSI, 87.72% and 54.7% on CMU-MOSEI, and 80.41%, 62.36%, and 43.54% for three- and five-level CH-SIMS tasks, with reduced mean absolute errors of 0.563, 0.513, and 0.408, respectively. We further report ordinal-consistency measures (QWK and level-jump statistics) to complement conventional metrics and quantify level-wise agreement. These results validate the effectiveness and generalization capability of the proposed framework.

CrossSent: Cross-Modal Attention with Pairwise Ranking Regularization for Multi-Modal Sentiment

Key Points

Abstract

Cite This Study