March 3, 2026Open Access

Sentiment Intensity Contrastive Text‐Enhanced Fusion Network

Key Points

This model achieves superior performance in accuracy and mean absolute error compared to state-of-the-art methods.
The research shows an increase of 10% in F1 score and seven-class accuracy in dataset evaluations.
Assessment using feature enhancement and self-attention mechanisms significantly enhances the model's effectiveness in noisy data contexts.
Combining sentiment intensity learning with text and non-text data bolsters the approach, highlighting potential scalability to broader applications.

Abstract

ABSTRACT Multimodal sentiment analysis (MSA) has recently encountered two major challenges: non‐textual modalities are often affected by noise, and sentiment intensity differences are difficult to capture. To address these issues, we propose a Sentiment Intensity Contrastive Text‐Enhanced Fusion Network (SICTEF Net), which achieves deep collaboration among text, audio, and visual modalities through three key mechanisms. First, a grouped‐channel‐attention based Feature Enhancement Module (EMA) is designed to mitigate modality‐specific noise and emphasize emotion‐sensitive cues by combining spatial–channel interaction mapping with dual‐branch attention fusion. Second, a text‐centered cross‐modal fusion mechanism is introduced, where bidirectional multi‐head self‐attention and a residual‐enhanced encoder jointly enable complementary mappings between text and non‐text modalities, thereby producing intermediate representations that preserve semantic primacy while incorporating fine‐grained complementary information. Third, a sentiment‐intensity weighted contrastive learning strategy dynamically assigns weights to positive and negative sample pairs according to their sentiment intensity differences, allowing the model to more precisely distinguish samples with varying degrees of similarity in the embedding space. Experimental evaluation on the CMU‐MOSI and CMU‐MOSEI datasets demonstrates that SICTEF Net consistently outperforms state‐of‐the‐art baselines in binary accuracy, F1 score, seven‐class accuracy, mean absolute error (MAE), and Pearson correlation. Comprehensive ablation studies further confirm the complementary benefits of EMA, the text‐enhanced Transformer, and sentiment‐intensity contrastive learning. These results indicate that combining text‐driven deep interaction, non‐text modality enhancement via channel attention, and contrastive learning can improve the accuracy and robustness of multimodal sentiment analysis.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Heng Jiang

Lianke Shi

Deyu Kong

Journals

Concurrency and Computation Practice and Experience

Actions

Institutions

Henan University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Sentiment Intensity Contrastive Text‐Enhanced Fusion Network

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study