Abstract Denoising diffusion models have demonstrated remarkable success in image generation, with numerous approaches achieving state-of-the-art synthesis quality. For autonomous driving applications, there is a critical need to extend these capabilities to multi-view image generation. However, achieving precise multi-view-consistent generation with 3D geometric awareness, critical for 3D perception tasks, remains challenging. Current approaches predominantly rely on overhead layout guidance, yet they frequently fail to maintain cross-view geometric coherence. This limitation manifests as misaligned object structures, discontinuous occlusions, and inconsistent depth relationships when synthesizing scenes from multiple angles. In this paper, we propose MvDeDiffusion, a diffusion-based framework for 3D-consistent multi-view image synthesis, which introduces two key innovations: (1) a cross-view deformable attention mechanism that explicitly enforces geometric and appearance consistency between adjacent viewpoints by adaptively aligning features domain in the denoising process, (2) a 3D-aware conditioning pipeline that integrates camera poses, foreground positional information, adjacent-view overlap to enable fine-grained control over scene structure while preserving photorealistic details. Our framework ensures view-consistent generation through explicit modeling of inter-perspective correlations during the diffusion process, overcoming the inherent limitations of independent per-view synthesis. Comprehensive experiments demonstrate that our model achieves:(1) superior multi-view continuity through geometrically coherent image synthesis,(2) maximizing controllability while preserving the richness of generated scenes.These advancements are quantitatively verified to significantly outperform existing approaches in both cross-view alignment fidelity and scene variation richness.
Lu et al. (Wed,) studied this question.