What question did this study set out to answer?

The aim is to re-evaluate the necessity of softmax in knowledge distillation and propose a more effective alternative.

April 12, 2026Open Access

PEAK: Pure Logit Distillation via Multi-granularity Knowledge Transfer

Key Points

The aim is to re-evaluate the necessity of softmax in knowledge distillation and propose a more effective alternative.
Theoretical analysis and motivational experiments were conducted to assess the softmax function in logits-based frameworks.
A novel method, PEAK, directly aligns teacher and student model logits without standard softmax.
Consistency augmentation mechanisms were used to maintain multi-granularity relative scales in distilled knowledge.
PEAK showed average performance improvements of 0.24 and 0.27 percentage points on CIFAR-100 and ImageNet, respectively.
Demonstrated improved stability and effectiveness over existing state-of-the-art methods through extensive evaluations.
Ablation studies confirmed the robustness of the proposed normalization approach.

Abstract

Traditional logits-based knowledge distillation methods typically follow a paradigm involving the application of a temperature-scaled softmax function to the logits, aiming for smooth matching. However, few have questioned whether the operational strategy is necessary and reasonable. In this paper, through theoretical analysis and motivational experiments, we confirm that softmax causes the student’s features to diverge from the teacher’s representation space. To address this issue, we propose PurE logit distillation via multi-grAnularity Knowledge transfer (PEAK), a simple but effective approach to knowledge distillation. Instead of relying on softmax, our PEAK directly aligns the original logits (termed pure logits) of the teacher and student model through a scale-invariant normalization module. By leveraging the softmax-free “PEAK” operator, our method achieves pure matching, naturally aligning the mean and standard deviation of teacher-student logits. Furthermore, our consistency augmentation mechanisms adequately preserve multi-granularity relative scales, including inter-class, intra-class, and global-class relationships. Extensive experiments across various tasks (image classification, object detection, and semantic segmentation) and architectures (Convolutional Neural Networks and Vision Transformers) demonstrate that our PEAK attains PEAK performance. For instance, on CIFAR-100 and ImageNet classification tasks, our approach achieved average performance improvements of 0.24 and 0.27 percentage points respectively compared to state-of-the-art methods, highlighting its stability and superiority. Comprehensive ablation and extension studies further validate the effectiveness of the proposed schemes.

Bookmark

View Full Paper

Bookmark

View Full Paper

PEAK: Pure Logit Distillation via Multi-granularity Knowledge Transfer

Key Points

Abstract

Cite This Study