With the rapid development of artificial intelligence technologies, facial image-based emotion recognition methods are increasingly becoming powerful auxiliary tools in psychological research. These methods offer feasible approaches for applications such as emotion quantification, psychological assessment, and clinical intervention. However, despite significant progress achieved by deep learning techniques in facial emotion recognition tasks in recent years, existing methods still face several limitations. Many studies rely solely on a single model architecture, such as Convolutional Neural Networks (CNN) or Vision Transformers (ViT), which may introduce biases in feature modeling. Moreover, publicly available facial emotion recognition datasets commonly suffer from imbalanced class distributions. To address the challenges of insufficient expression feature modeling and performance degradation under class imbalance, this paper proposes a Multi-scale Hybrid Deep Focal Network (MHFNet) for psychological emotion recognition tasks. The proposed model integrates multi-scale feature information extracted from both CNNr-based and transformer-based architectures, and incorporates an improved Focal Loss function to enhance robustness against class imbalance. Experimental results on a public benchmark dataset demonstrate that the proposed MHFNet achieves an accuracy of 0.690, a recall of 0.676, and an F1-score of 0.672, outperforming existing mainstream methods in terms of both overall performance and class-wise robustness, thereby validating its effectiveness and feasibility in complex facial expression recognition scenarios.
Peilin Chen (Fri,) studied this question.