LiDAR-based 3D object detection has witnessed significant progress with the introduction of Transformer architectures. Currently, Bird’s-Eye-View (BEV) based methods, such as TransFusion, dominate the field by flattening 3D voxel features into 2D representations for efficient query processing. However, this projection inevitably leads to the loss of crucial vertical geometric information, resulting in suboptimal performance for objects with complex height profiles or in occluded scenarios. In this paper, we present SV-TransFusion, a novel framework designed to mitigate this limitation by re-establishing the connection between object queries and raw 3D structural data. Our approach incorporates two primary innovations. First, we propose the Sparse Voxel-Query Interaction (SVQI) module. Instead of relying solely on compressed BEV features, SVQI allows learnable queries to directly attend to the sparse, non-empty 3D voxels from the backbone, effectively retrieving fine-grained height and structural information. Second, to accelerate convergence and enhance training stability, we introduce a Query-based Contrastive Denoising (QCD) strategy. This mechanism aids the bipartite matching process by introducing noise-corrupted queries during training, thereby enabling the model to learn more robust feature representations. Extensive experiments on the nuScenes dataset demonstrate that SV-TransFusion achieves state-of-the-art performance, significantly outperforming baseline methods in detection accuracy with a moderate computational overhead.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tianli Shi
Scientific Reports
China Electronics Technology Group Corporation
Building similarity graph...
Analyzing shared references across papers
Loading...
Tianli Shi (Fri,) studied this question.
www.synapsesocial.com/papers/69b606ea83145bc643d1d573 — DOI: https://doi.org/10.1038/s41598-026-42093-y