Few-shot learning (FSL) and data-efficient learning paradigms enable object detection models to recognize novel classes from minimally annotated examples, addressing expensive data-labeling challenges. This systematic survey examines recent advances in few-shot, semi-supervised, sparsely-supervised, and weakly-supervised approaches for video and 3D object detection, focusing on developments through foundation models and vision-language model integration. For video object detection, techniques including tube proposals, temporal matching networks, motion-guided approaches, and temporal consistency-based semi-supervised methods utilize spatiotemporal relationships for efficient novel class adaptation, with recent architectures achieving substantial gains from 33 to 48 average precision in few-shot scenarios. For 3D object detection, specialized approaches address point cloud sparsity and texture limitations through uncertainty-aware methods, geometric learning, and multimodal fusion, with sparsely-supervised techniques achieving competitive performance using only 2% of annotations, enabling practical deployment in autonomous driving and robotics. The survey analyzes methodological advances including meta-learning, transfer learning, pseudo-label generation, contrastive instance mining, and foundation model integration across applications spanning autonomous driving, surveillance, robotics, industrial control, and medical imaging. By examining developments across multiple supervision paradigms, this work highlights data-efficient learning’s potential for minimizing annotation requirements and enabling robust real-world deployment across temporal, spatial, and multimodal domains.
Ferdaus et al. (Mon,) studied this question.