What question did this study set out to answer?

The aim is to improve cross-modal place recognition by addressing the challenges of domain gaps and scale variations in feature learning.

March 4, 2026Open Access

MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition

Key Points

The aim is to improve cross-modal place recognition by addressing the challenges of domain gaps and scale variations in feature learning.
Developed a framework using a hierarchical Swin Transformer for feature extraction from 2D representations.
Implemented a multi-scale self-distillation approach for intra-modal feature learning.
Achieved cross-modal alignment through a global contrastive loss focusing on teacher embeddings.
Achieved state-of-the-art performance on KITTI and KITTI-360 datasets.
Recall@1 exceeded 60% on all evaluable sequences of KITTI-360 at a 10 m threshold without fine-tuning.

Abstract

Place recognition is a fundamental challenge for robotics and autonomous vehicles. While visual place recognition has achieved high precision, cross-modal place recognition—specifically, visual localization within large-scale point cloud maps—remains a formidable problem. Existing methods often struggle with the significant domain gap between modalities and can be computationally prohibitive, especially those processing raw 3D point clouds. Furthermore, they frequently fail to learn features invariant to viewpoint and scale variations, limiting generalization to unseen environments. In this paper, we formulate cross-modal recognition as a problem of learning a scale-invariant, unified embedding space. Our framework employs a hierarchical Swin Transformer to extract multi-scale features from unified 2D representations of both modalities. The central principle of our method is a multi-scale self-distillation paradigm, which recasts feature learning as an intra-modal knowledge transfer task. Specifically, the coarse-scale “teacher” features provide supervision for the fine-scale “student” features. The final inter-modal alignment is then achieved via a global contrastive loss, exclusively leveraging the semantically rich “teacher” embeddings to ensure a reliable and discriminative matching. Extensive experiments on the KITTI and KITTI-360 datasets demonstrate that our method achieves state-of-the-art performance. Notably, using only the KITTI-trained model without fine-tuning, Recall@1 exceeds 60% on all evaluable sequences of KITTI-360 at a 10 m threshold. Code and pre-trained models will be made publicly available upon acceptance.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper