March 3, 2026Open Access

Multimodal Emotion Recognition : A Comprehensive Approach Using Text and Audio Modalities

Key Points

The Adaptive Hierarchical Attention Fusion model improved emotion recognition accuracy by 7.8%.
Weighted average F1-score was used to evaluate performance across six emotions from the IEMOCAP dataset.
The study introduced a Reliability Estimation Component to dynamically assess modality contributions.
Hierarchical Attention Components enhanced the model's ability to capture intra-modal and cross-modal dependencies.

Abstract

Emotions are fundamental to human communication and decision-making, shapinginteractions, behaviors, and social bonds. Accurately recognizing emotions is criticalin domains such as healthcare, education, and human-computer interaction.Multimodal Emotion Recognition (MER) enhances emotion classification byleveraging diverse data sources, or modalities, such as text and audio. Within thebroader field of multimodal machine learning (MML), five foundational conceptsguide development: representation, co-learning, fusion, translation, and alignment.Among these, fusion is the central focus of this thesis, as it plays a critical role inintegrating heterogeneous information streams into a unified representation toimprove understanding and prediction.Despite significant progress, current MER approaches face various critical limitations.First, the reliability and contribution of different modalities vary considerably, yetmost models treat modalities equally. Without adaptive mechanisms, noisy orunreliable modalities can dominate, leading to unstable or biased predictions in real-world applications. Second, existing fusion strategies are often shallow, limiting themodels ability to capture intra-modal dependencies (within a single modality) andcross-modal interactions (across modalities).This thesis addressed the challenges through two key contributions. A ReliabilityEstimation Component is introduced to dynamically assess the reliability of eachmodality on a per-instance, per-emotion basis, allowing adaptive weighting thatmitigates the effect of noisy inputs and modality contribution. In parallel, aHierarchical Attention Component is proposed to capture both intra-modal andcross-modal dependencies, enabling richer and context-aware fusion.The thesis focused on six emotions within the IEMOCAP dataset. Text features arederived from RoBERTa, while audio features are extracted using Wav2Vec. Self-attention is applied to capture intra-modality dependencies, while cross-attentionmodels cross-modality interactions. The experimental design followed a leave-one-session-out cross-validation protocol to ensure speaker independence, with weightedaverage F1-score serving as the primary evaluation metric.Experimental findings showed that the proposed Adaptive Hierarchical AttentionFusion (AHAF) model surpasses two state-of-the-art baselines, achieving weightedF1-score improvements ranging from 0.05% to 7.8%. Ablation studies confirmed thatreliability estimation and hierarchical attention components are key drivers ofperformance gains, validating their role in building more accurate and generalizableemotion recognition models.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Wael Aburizq

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Multimodal Emotion Recognition : A Comprehensive Approach Using Text and Audio Modalities

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study