What question did this study set out to answer?

The research aims to develop a new approach for 3D object detection without relying on fixed categories or 2D detectors.

April 15, 2026Open Access

CLIP-Mono3D: End-to-End Open-Vocabulary Monocular 3D Object Detection via Semantic–Geometric Similarity

Key Points

The research aims to develop a new approach for 3D object detection without relying on fixed categories or 2D detectors.
Proposed CLIP-Mono3D as a one-stage transformer framework
Integrated vision-language semantics for monocular 3D detection
Introduced OV-KITTI, a benchmark with new categories and annotations
Conducted experiments using OV-KITTI, KITTI, and Argoverse datasets
Achieved robust zero-shot generalization to new categories
Showed competitive performance in open-vocabulary detection
Utilized over 7000 annotated 3D bounding boxes from OV-KITTI

Abstract

Open-vocabulary 3D object detection (OV-3DOD) is crucial for real-world perception, yet existing monocular methods are often limited by predefined categories or heavy reliance on external 2D detectors. In this paper, we propose CLIP-Mono3D, an end-to-end one-stage transformer framework that directly integrates vision–language semantics into monocular 3D detection. By leveraging CLIP-derived semantic priors and grounding object queries in semantically salient regions, our model achieves robust zero-shot generalization to novel categories without requiring auxiliary 2D detectors. Furthermore, we introduce OV-KITTI, a large-scale benchmark extending KITTI with 40 new categories and over 7000 annotated 3D bounding boxes. Extensive experiments on OV-KITTI, KITTI, and Argoverse demonstrate that CLIP-Mono3D achieves competitive performance in open-vocabulary scenarios.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zichong Gu

Shiyi Mu

Hanqi Lyu

Journals

Sensors

Actions

Institutions

Shanghai University

Xi’an Jiaotong-Liverpool University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CLIP-Mono3D: End-to-End Open-Vocabulary Monocular 3D Object Detection via Semantic–Geometric Similarity

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study