Accurate 3D object detection from images can be hindered by inherent depth ambiguity. While knowledge distillation (KD) from privileged sensors such as LiDAR offers a promising direction, it often suffers from a critical cross-sensor domain gap. To address this, we introduce DK3D, a novel depth-aware knowledge distillation framework for 3D detection. Our core strategy involves providing the teacher with privileged ground-truth depth during training. This directly avoids the feature representation mismatch and subsequent inefficient knowledge transfer required when distilling from a LiDAR teacher (sparse, geometric) to a camera-based student (dense, semantic). DK3D introduces specialized modules tailored for two primary student paradigms. For depth-assisted models, we employ a channel-wise projection layer (CPL) and an adversarial scoring block (ASB) to align intermediate features at both the pixel and distribution levels. For depth-independent models, a novel vision-depth association module allows the student to implicitly reason about geometry by fusing depth cues with visual features. Both approaches are further enhanced by target-aware spatial response distillation, which captures complex inter-object spatial relationships. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that DK3D significantly improves performance for both monocular and multi-view 3D detection, outperforming state-of-the-art methods. As a versatile, plug-and-play framework, DK3D boosts existing models without requiring additional training data or increasing the computational cost at inference.
Wu et al. (Wed,) studied this question.