Convolutional neural networks (CNNs) have been widely used for acne image classification due to their effectiveness in capturing local texture of skin lesions. However, the locality of convolution operations limits their ability to model long-range dependencies. Vision Transformer (ViT) methods address this issue to some extent but their high computational complexity and reliance on large-scale pre-training present challenges. Although CNN–Transformer architecture alleviates this conflict to some extent, acne images present task-specific challenges, including indistinct lesion boundaries, subtle inter-class variations, and various facial interference factors. In this paper, we propose AcneFormer, a lesion-aware and noise-robust CNN–Transformer architecture for acne image classification. We introduce three modules especially for acne tasks: a Lesion Cue Enhancement (LCE) module to highlight discriminative multi-scale spatial patterns, a Cross-Layer Feature Transmission (CLFT) module to enhance cross-layer information flow in Transformers, and a Differential Semantic Denoising (DSD) module to suppress irrelevant responses during deep feature interaction. Extensive experiments show that AcneFormer outperforms several strong baselines. Ablation and external lesion-annotated analyses further show a consistent pattern: LCE mainly improves lesion-sensitive localization and class-balanced recognition, CLFT expands valid cross-depth lesion evidence, and DSD suppresses off-lesion semantic responses.
Zhou et al. (Mon,) studied this question.