Abstract Facial Beauty Prediction (FBP) is a long-standing and inherently challenging task in computer vision, primarily due to the subjective and multifaceted nature of human aesthetic judgment. Perceived facial beauty is influenced by a combination of global facial harmony, local feature attractiveness, and cultural or individual biases—factors that are often difficult to quantify or model explicitly. While deep learning has significantly advanced the field, existing approaches frequently rely on convolutional backbones that emphasize local texture details but fail to capture the subtle, holistic relationships among facial components that define aesthetic perception. To address these limitations, we introduce TransFBP, a novel Vision Transformer (ViT)-based framework specifically designed for interpretable and human-aligned facial beauty assessment. The proposed model incorporates two major innovations. First, we develop a Cross-Attention Head that acts as a dynamic filter, enabling the network to automatically focus on the most visually meaningful facial areas—such as the eyes, lips, and overall symmetry—while suppressing irrelevant background information. This design enables the model to focus adaptively on semantically meaningful areas—such as the eyes, lips, and overall facial symmetry—thereby offering interpretable insights into what drives the aesthetic predictions. Second, to mitigate overfitting and enhance generalization, we propose Attention-Guided TransMix, a two-stage semantic data augmentation strategy. In the first stage, the method generates challenging hybrid samples by conditionally mixing images from opposite ends of the beauty score distribution, encouraging the model to learn discriminative features across a wide aesthetic spectrum. In the second stage, the model’s own attention maps are leveraged to generate a semantically grounded supervisory score for each mixed image, ensuring that the augmented samples remain perceptually meaningful and score-consistent. We comprehensively evaluate TransFBP on the FBP5500 dataset, where our method achieves a state-of-the-art Pearson Correlation Coefficient (PCC) of 0.9291, surpassing existing approaches. The strong empirical results validate the effectiveness of our cross-attention mechanism and attention-guided augmentation strategy. Moreover, the interpretability of our attention maps provides valuable transparency into the model’s decision process, paving the way for more explainable, reliable, and ethically aligned AI systems in aesthetic perception tasks. The code will be available at https://github.com/DjameleddineBoukhari/transFBP We introduced TransFBP, a novel Transformer-based framework that advances both the performance and interpretability of facial beauty prediction. The proposed approach is characterized by two key innovations: a Cross-Attention Head, which enables the model to dynamically integrate the most salient and contextually relevant facial features, and an Attention-Guided TransMix augmentation strategy, which enhances regularization by generating semantically consistent and challenging training samples.
Boukhari et al. (Sat,) studied this question.