The study introduces CrossMF as a unified transformer-based model which performs emotion recognition from speech and text data. CrossMF combines dynamic attention between modalities with a memory enhancement system for effective fusion between textual and acoustic information. The training of CrossMF involves a two-step tactic where the acoustic encoder receives clean audio inputs from Toronto emotional speech set (TESS) for optimization yet the text encoder and fusion module acquire training from audio-text pairs from the Multimodal EmotionLines Dataset (MELD) dataset. The system allows emotion predictions from three combination types including audio-only, text-only and audio with text inputs due to adjustable modality access. The evaluation process takes place across the three different input scenarios to show that the system performs well with high generalization capability in both laboratory-recorded and natural conversational conditions. Fusion between multiple sources only occurs when both inputs are available so that integrity can be maintained until integration occurs. The model reaches a maximum validation accuracy of 97.68% while demonstrating sustained high-test performance which proves its effectiveness when operated in various conditions without depending on manually created features. The architecture also supports future extensions, allowing developers to easily incorporate ablation studies and adaptive training strategies for real-world emotion-aware systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
International Journal of Pattern Recognition and Artificial Intelligence
Twitter (United States)
Add This Paper to Your Research Feed
Any time a new paper drops it will be there.
Tirumanadham et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: