What question did this study set out to answer?

The research aims to enhance multimodal sentiment analysis by addressing the challenges of data diversity and redundancy.

April 10, 2026

Multimodal Sentiment Analysis Based on Dynamic Language Enhancement and Synergistic Cross-Modal Transformer

Key Points

The research aims to enhance multimodal sentiment analysis by addressing the challenges of data diversity and redundancy.
Proposed a dynamic linguistic enhancement network for feature extraction.
Utilized a guided attention mechanism to capture contextual cues from language representations.
Developed a synergistic cross-modal transformer for feature interaction among modalities.
Implemented a bimodal generator for composite feature production and transfer.
Conducted experiments on three popular datasets for performance evaluation.
Achieved 86.43% accuracy on CMU-MOSI, 86.38% on CMU-MOSEI, and 81.35% on CH-SIMS.
Demonstrated superior performance compared to state-of-the-art methods in multimodal sentiment analysis.

Abstract

Multimodal sentiment analysis (MSA) is a challenging task that utilizes verbal, visual, and acoustic cues to infer human sentiment and has garnered substantial research attention in recent years. However, due to the diversity of multimodal data, current MSA methods often fail to adequately leverage the rich semantic knowledge present in the linguistic modality while also overlooking the issue of informational redundancy within the visual and auditory modalities. In addition, the intermodal heterogeneity and spurious cross-modal interactions also pose huge challenges for effective multimodal fusion. To address these issues, we propose an MSA approach based on dynamic linguistic enhancement and synergistic cross-modal Transformer (LESCT). Our LESCT constructs a dynamic language enhancement network (LEN) for feature extraction. The proposed LEN enables visual and auditory features to dynamically capture contextual cues from multigranularity language representations via guided attention mechanism, thereby mitigating intramodal redundancy and noise interference. On this basis, the LESCT builds a new synergistic cross-modal Transformer (SCT) and local-to-composite multimodal fusion strategy. The SCT network employs a bimodal generator to produce composite features for each pair of modalities, transferring the composite information from the bimodal features to complementary unimodal features to facilitate rich intermodel and intramodal interaction. Extensive experiments were performed on three popular MSA benchmark datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS. The overall accuracy of our LESCT is 86.43% on CMU-MOSI, 86.38% on CMU-MOSEI, and 81.35% on CH-SIMS. Experimental results demonstrate that our proposed LESCT is superior to the state-of-the-art (SOTA) methods. The code is available at https://github.com/jhwvh/LESCT.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Linqin Cai

Daohong Liu

Lanrui Liu

Journals

IEEE Transactions on Neural Networks and Learning Systems

Actions

Institutions

Chongqing University of Posts and Telecommunications

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Multimodal Sentiment Analysis Based on Dynamic Language Enhancement and Synergistic Cross-Modal Transformer

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider