Recent advances in Large Vision–Language Models (LVLMs) have demonstrated strong cross-modal reasoning capabilities, offering new opportunities for decision-making in autonomous driving. However, existing end-to-end approaches still suffer from limited semantic consistency, weak task controllability, and insufficient interpretability. To address these challenges, we propose SemAlign-E2E (Semantic-Aligned End-to-End), a semantic-aligned multimodal LVLM framework that unifies visual, LiDAR, and task-oriented textual inputs through cross-modal attention. This design enables end-to-end reasoning from scene understanding to high-level driving command generation. Beyond producing structured control instructions, the framework also provides natural-language explanations to enhance interpretability. We conduct extensive evaluations on the nuScenes dataset and CARLA simulation platform. Experimental results show that SemAlign-E2E achieves substantial improvements in driving stability, safety, multi-task generalization, and semantic comprehension, consistently outperforming state-of-the-art baselines. Notably, the framework exhibits superior behavioral consistency and risk-aware decision-making in complex traffic scenarios. These findings highlight the potential of LVLM-driven semantic reasoning for autonomous driving and provide a scalable pathway toward future semantic-enhanced end-to-end driving systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Peng et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69730f78c8125b09b0d1f3d4 — DOI: https://doi.org/10.3390/machines14010125
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Feng Peng
Shangju She
Zejian Deng
Machines
University of Hong Kong
Chinese University of Hong Kong
Wuhan University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...