Emotions are fundamental to human communication and decision-making, shapinginteractions, behaviors, and social bonds. Accurately recognizing emotions is criticalin domains such as healthcare, education, and human-computer interaction.Multimodal Emotion Recognition (MER) enhances emotion classification byleveraging diverse data sources, or modalities, such as text and audio. Within thebroader field of multimodal machine learning (MML), five foundational conceptsguide development: representation, co-learning, fusion, translation, and alignment.Among these, fusion is the central focus of this thesis, as it plays a critical role inintegrating heterogeneous information streams into a unified representation toimprove understanding and prediction.Despite significant progress, current MER approaches face various critical limitations.First, the reliability and contribution of different modalities vary considerably, yetmost models treat modalities equally. Without adaptive mechanisms, noisy orunreliable modalities can dominate, leading to unstable or biased predictions in real-world applications. Second, existing fusion strategies are often shallow, limiting themodels ability to capture intra-modal dependencies (within a single modality) andcross-modal interactions (across modalities).This thesis addressed the challenges through two key contributions. A ReliabilityEstimation Component is introduced to dynamically assess the reliability of eachmodality on a per-instance, per-emotion basis, allowing adaptive weighting thatmitigates the effect of noisy inputs and modality contribution. In parallel, aHierarchical Attention Component is proposed to capture both intra-modal andcross-modal dependencies, enabling richer and context-aware fusion.The thesis focused on six emotions within the IEMOCAP dataset. Text features arederived from RoBERTa, while audio features are extracted using Wav2Vec. Self-attention is applied to capture intra-modality dependencies, while cross-attentionmodels cross-modality interactions. The experimental design followed a leave-one-session-out cross-validation protocol to ensure speaker independence, with weightedaverage F1-score serving as the primary evaluation metric.Experimental findings showed that the proposed Adaptive Hierarchical AttentionFusion (AHAF) model surpasses two state-of-the-art baselines, achieving weightedF1-score improvements ranging from 0.05% to 7.8%. Ablation studies confirmed thatreliability estimation and hierarchical attention components are key drivers ofperformance gains, validating their role in building more accurate and generalizableemotion recognition models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Wael Aburizq
Building similarity graph...
Analyzing shared references across papers
Loading...
Wael Aburizq (Thu,) studied this question.