What question did this study set out to answer?

This research aims to develop a more effective model for recognizing emotions from facial images by integrating multiple features.

May 9, 2026Open Access

A Multi-scale feature-based Hybrid Deep Focal Network for psychological emotion recognition

Key Points

This research aims to develop a more effective model for recognizing emotions from facial images by integrating multiple features.
Developed a Multi-scale Hybrid Deep Focal Network (MHFNet) combining CNN and transformer architectures.
Implemented an improved Focal Loss function to address class imbalance issues.
Evaluated the model on a public benchmark dataset for facial emotion recognition.
Achieved an accuracy of 0.690, a recall of 0.676, and an F1-score of 0.672.
Outperformed existing methods in overall performance and robustness across various classes.
Demonstrated effectiveness and feasibility in complex facial expression recognition scenarios.

Abstract

With the rapid development of artificial intelligence technologies, facial image-based emotion recognition methods are increasingly becoming powerful auxiliary tools in psychological research. These methods offer feasible approaches for applications such as emotion quantification, psychological assessment, and clinical intervention. However, despite significant progress achieved by deep learning techniques in facial emotion recognition tasks in recent years, existing methods still face several limitations. Many studies rely solely on a single model architecture, such as Convolutional Neural Networks (CNN) or Vision Transformers (ViT), which may introduce biases in feature modeling. Moreover, publicly available facial emotion recognition datasets commonly suffer from imbalanced class distributions. To address the challenges of insufficient expression feature modeling and performance degradation under class imbalance, this paper proposes a Multi-scale Hybrid Deep Focal Network (MHFNet) for psychological emotion recognition tasks. The proposed model integrates multi-scale feature information extracted from both CNNr-based and transformer-based architectures, and incorporates an improved Focal Loss function to enhance robustness against class imbalance. Experimental results on a public benchmark dataset demonstrate that the proposed MHFNet achieves an accuracy of 0.690, a recall of 0.676, and an F1-score of 0.672, outperforming existing mainstream methods in terms of both overall performance and class-wise robustness, thereby validating its effectiveness and feasibility in complex facial expression recognition scenarios.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper