Open-vocabulary 3D object detection (OV-3DOD) is crucial for real-world perception, yet existing monocular methods are often limited by predefined categories or heavy reliance on external 2D detectors. In this paper, we propose CLIP-Mono3D, an end-to-end one-stage transformer framework that directly integrates vision–language semantics into monocular 3D detection. By leveraging CLIP-derived semantic priors and grounding object queries in semantically salient regions, our model achieves robust zero-shot generalization to novel categories without requiring auxiliary 2D detectors. Furthermore, we introduce OV-KITTI, a large-scale benchmark extending KITTI with 40 new categories and over 7000 annotated 3D bounding boxes. Extensive experiments on OV-KITTI, KITTI, and Argoverse demonstrate that CLIP-Mono3D achieves competitive performance in open-vocabulary scenarios.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zichong Gu
Shiyi Mu
Hanqi Lyu
Sensors
Shanghai University
Xi’an Jiaotong-Liverpool University
Building similarity graph...
Analyzing shared references across papers
Loading...
Gu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c2fe4eeef8a2a6b1338 — DOI: https://doi.org/10.3390/s26082380