Estimating 6D object pose from monocular RGB images remains a critical yet data-intensive challenge in computer vision. In this work, we propose a novel few-shot 6D pose estimation framework that explicitly decouples rotation and translation estimation, significantly reducing dependence on large-scale annotated real-world data. Our method employs a viewpoint encoder trained solely on synthetic data to generate a codebook for rotation retrieval, complemented by an in-plane rotation regression module. For translation, we adopt a geometry-aware regression network based on dense 2D–3D correspondences. Experimental results on LINEMOD, LM-O, and YCB-V datasets demonstrate that our approach achieves state-of-the-art performance (97.6%, 65.3%, and 65.9% ADD(-S), respectively), using only 600 real images per object—cutting real data requirements by 80% compared to typical fully-supervised 6D pose estimation methods. These findings highlight the effectiveness and generalization ability of our method under limited supervision.
Lu et al. (Wed,) studied this question.