This paper evaluates DINOv3, a recent large-scale self-supervised vision backbone, for visuomotor diffusion policy learning in robotic manipulation. We investigate whether a purely self-supervised encoder can match or surpass conventional supervised ImageNet-pretrained backbones (e.g., ResNet-18) under three regimes: training from scratch, frozen, and finetuned. Across four benchmark tasks (Push-T, Lift, Can, Square) using a unified FiLM-conditioned diffusion policy, we find that (i) finetuned DINOv3 matches or exceeds ResNet-18 on several tasks, (ii) frozen DINOv3 remains competitive, indicating strong transferable priors, and (iii) self-supervised features improve sample efficiency and robustness. These results support self-supervised large visual models as effective, generalizable perceptual front-ends for action diffusion policies, motivating further exploration of scalable label-free pretraining in robotic manipulation. Compared to using ResNet18 as a backbone, our approach with DINOv3 achieves up to a 10% absolute increase in test-time success rates on challenging tasks such as Can, and on-the-par performance in tasks like Lift, PushT, and Square.
Building similarity graph...
Analyzing shared references across papers
Loading...
T. I. Egbe
Peng Wang
Zhihao Guo
Building similarity graph...
Analyzing shared references across papers
Loading...
Egbe et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e0450fa99c246f578b408b — DOI: https://doi.org/10.48550/arxiv.2509.17684
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: