Traditional logits-based knowledge distillation methods typically follow a paradigm involving the application of a temperature-scaled softmax function to the logits, aiming for smooth matching. However, few have questioned whether the operational strategy is necessary and reasonable. In this paper, through theoretical analysis and motivational experiments, we confirm that softmax causes the student’s features to diverge from the teacher’s representation space. To address this issue, we propose PurE logit distillation via multi-grAnularity Knowledge transfer (PEAK), a simple but effective approach to knowledge distillation. Instead of relying on softmax, our PEAK directly aligns the original logits (termed pure logits) of the teacher and student model through a scale-invariant normalization module. By leveraging the softmax-free “PEAK” operator, our method achieves pure matching, naturally aligning the mean and standard deviation of teacher-student logits. Furthermore, our consistency augmentation mechanisms adequately preserve multi-granularity relative scales, including inter-class, intra-class, and global-class relationships. Extensive experiments across various tasks (image classification, object detection, and semantic segmentation) and architectures (Convolutional Neural Networks and Vision Transformers) demonstrate that our PEAK attains PEAK performance. For instance, on CIFAR-100 and ImageNet classification tasks, our approach achieved average performance improvements of 0.24 and 0.27 percentage points respectively compared to state-of-the-art methods, highlighting its stability and superiority. Comprehensive ablation and extension studies further validate the effectiveness of the proposed schemes.
Xia et al. (Wed,) studied this question.