What question did this study set out to answer?

The study aims to develop CrossMF, a transformer-based model for recognizing emotions in speech and text data.

February 14, 2026

CrossMF: A Memory-Augmented Transformer for Fast and Generalizable Emotion Recognition

Puntos clave

The study aims to develop CrossMF, a transformer-based model for recognizing emotions in speech and text data.
Developed CrossMF with a memory-augmented architecture for emotion detection.
Utilized a two-step training process for acoustic and text encoders using TESS and MELD datasets.
Implemented dynamic attention mechanism for effective integration of audio and text inputs.
Evaluated the model across different input scenarios: audio-only, text-only, and combined.
Achieved a maximum validation accuracy of 97.68%.
Demonstrated robust performance across laboratory and natural conversational settings.
Showed generalization capability without relying on predefined features.

Resumen

The study introduces CrossMF as a unified transformer-based model which performs emotion recognition from speech and text data. CrossMF combines dynamic attention between modalities with a memory enhancement system for effective fusion between textual and acoustic information. The training of CrossMF involves a two-step tactic where the acoustic encoder receives clean audio inputs from Toronto emotional speech set (TESS) for optimization yet the text encoder and fusion module acquire training from audio-text pairs from the Multimodal EmotionLines Dataset (MELD) dataset. The system allows emotion predictions from three combination types including audio-only, text-only and audio with text inputs due to adjustable modality access. The evaluation process takes place across the three different input scenarios to show that the system performs well with high generalization capability in both laboratory-recorded and natural conversational conditions. Fusion between multiple sources only occurs when both inputs are available so that integrity can be maintained until integration occurs. The model reaches a maximum validation accuracy of 97.68% while demonstrating sustained high-test performance which proves its effectiveness when operated in various conditions without depending on manually created features. The architecture also supports future extensions, allowing developers to easily incorporate ablation studies and adaptive training strategies for real-world emotion-aware systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Journals

International Journal of Pattern Recognition and Artificial Intelligence

Institutions

Twitter (United States)

References and Citations

Add This Paper to Your Research Feed

Any time a new paper drops it will be there.