What question did this study set out to answer?

This research aims to improve the accuracy and explainability of facial beauty prediction using a novel vision transformer framework.

May 7, 2026Open Access

Enhancing Facial Beauty Prediction with a Cross-Attention Vision Transformer and Attention-Guided Augmentation

Key Points

This research aims to improve the accuracy and explainability of facial beauty prediction using a novel vision transformer framework.
Developed a Cross-Attention Head to enhance focus on meaningful facial features.
Proposed Attention-Guided TransMix for data augmentation to create semantically coherent training samples.
Evaluated performance on the FBP5500 dataset, achieving a high Pearson Correlation Coefficient of 0.9291.
Achieved a Pearson Correlation Coefficient (PCC) of 0.9291, outperforming existing models.
Demonstrated enhanced model interpretability through attention maps.
Validated the effectiveness of both the cross-attention mechanism and attention-guided data augmentation.

Abstract

Abstract Facial Beauty Prediction (FBP) is a long-standing and inherently challenging task in computer vision, primarily due to the subjective and multifaceted nature of human aesthetic judgment. Perceived facial beauty is influenced by a combination of global facial harmony, local feature attractiveness, and cultural or individual biases—factors that are often difficult to quantify or model explicitly. While deep learning has significantly advanced the field, existing approaches frequently rely on convolutional backbones that emphasize local texture details but fail to capture the subtle, holistic relationships among facial components that define aesthetic perception. To address these limitations, we introduce TransFBP, a novel Vision Transformer (ViT)-based framework specifically designed for interpretable and human-aligned facial beauty assessment. The proposed model incorporates two major innovations. First, we develop a Cross-Attention Head that acts as a dynamic filter, enabling the network to automatically focus on the most visually meaningful facial areas—such as the eyes, lips, and overall symmetry—while suppressing irrelevant background information. This design enables the model to focus adaptively on semantically meaningful areas—such as the eyes, lips, and overall facial symmetry—thereby offering interpretable insights into what drives the aesthetic predictions. Second, to mitigate overfitting and enhance generalization, we propose Attention-Guided TransMix, a two-stage semantic data augmentation strategy. In the first stage, the method generates challenging hybrid samples by conditionally mixing images from opposite ends of the beauty score distribution, encouraging the model to learn discriminative features across a wide aesthetic spectrum. In the second stage, the model’s own attention maps are leveraged to generate a semantically grounded supervisory score for each mixed image, ensuring that the augmented samples remain perceptually meaningful and score-consistent. We comprehensively evaluate TransFBP on the FBP5500 dataset, where our method achieves a state-of-the-art Pearson Correlation Coefficient (PCC) of 0.9291, surpassing existing approaches. The strong empirical results validate the effectiveness of our cross-attention mechanism and attention-guided augmentation strategy. Moreover, the interpretability of our attention maps provides valuable transparency into the model’s decision process, paving the way for more explainable, reliable, and ethically aligned AI systems in aesthetic perception tasks. The code will be available at https://github.com/DjameleddineBoukhari/transFBP We introduced TransFBP, a novel Transformer-based framework that advances both the performance and interpretability of facial beauty prediction. The proposed approach is characterized by two key innovations: a Cross-Attention Head, which enables the model to dynamically integrate the most salient and contextually relevant facial features, and an Attention-Guided TransMix augmentation strategy, which enhances regularization by generating semantically consistent and challenging training samples.

Mark Helpful

Bookmark

Relay

View Full Paper