March 3, 2026Open Access

Few-Shot 6D Object Pose Estimation via Decoupled Rotation and Translation with Viewpoint Encoding

Key Points

6D object pose estimation significantly reduces data requirements by 80%, achieving state-of-the-art scores in several datasets.
Achieved performance includes 97.6% on LINEMOD, 65.3% on LM-O, and 65.9% on YCB-V datasets using just 600 real images per object.
Employing a viewpoint encoder trained on synthetic data helps with rotation retrieval while a regression network focuses on translation.
Findings suggest this decoupling approach enhances model generalization, making it effective under limited supervision.

Abstract

Estimating 6D object pose from monocular RGB images remains a critical yet data-intensive challenge in computer vision. In this work, we propose a novel few-shot 6D pose estimation framework that explicitly decouples rotation and translation estimation, significantly reducing dependence on large-scale annotated real-world data. Our method employs a viewpoint encoder trained solely on synthetic data to generate a codebook for rotation retrieval, complemented by an in-plane rotation regression module. For translation, we adopt a geometry-aware regression network based on dense 2D–3D correspondences. Experimental results on LINEMOD, LM-O, and YCB-V datasets demonstrate that our approach achieves state-of-the-art performance (97.6%, 65.3%, and 65.9% ADD(-S), respectively), using only 600 real images per object—cutting real data requirements by 80% compared to typical fully-supervised 6D pose estimation methods. These findings highlight the effectiveness and generalization ability of our method under limited supervision.

Mark Helpful

Bookmark

Relay

View Full Paper