What question did this study set out to answer?

The study aims to improve decision-making in autonomous driving through a semantic-aligned multimodal framework.

January 23, 2026Open Access

Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making

Key Points

The study aims to improve decision-making in autonomous driving through a semantic-aligned multimodal framework.
Proposed SemAlign-E2E framework integrates visual, LiDAR, and textual inputs.
Utilized cross-modal attention for scene understanding and command generation.
Conducted evaluations on the nuScenes dataset and CARLA simulation platform.
Achieved improvements in driving stability and safety.
Demonstrated multi-task generalization and semantic comprehension.
Outperformed state-of-the-art methods in complex traffic scenarios.

Abstract

Recent advances in Large Vision–Language Models (LVLMs) have demonstrated strong cross-modal reasoning capabilities, offering new opportunities for decision-making in autonomous driving. However, existing end-to-end approaches still suffer from limited semantic consistency, weak task controllability, and insufficient interpretability. To address these challenges, we propose SemAlign-E2E (Semantic-Aligned End-to-End), a semantic-aligned multimodal LVLM framework that unifies visual, LiDAR, and task-oriented textual inputs through cross-modal attention. This design enables end-to-end reasoning from scene understanding to high-level driving command generation. Beyond producing structured control instructions, the framework also provides natural-language explanations to enhance interpretability. We conduct extensive evaluations on the nuScenes dataset and CARLA simulation platform. Experimental results show that SemAlign-E2E achieves substantial improvements in driving stability, safety, multi-task generalization, and semantic comprehension, consistently outperforming state-of-the-art baselines. Notably, the framework exhibits superior behavioral consistency and risk-aware decision-making in complex traffic scenarios. These findings highlight the potential of LVLM-driven semantic reasoning for autonomous driving and provide a scalable pathway toward future semantic-enhanced end-to-end driving systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Peng et al. (Wed,) studied this question.

www.synapsesocial.com/papers/69730f78c8125b09b0d1f3d4 — DOI: https://doi.org/10.3390/machines14010125

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

VAD: Vectorized Scene Representation for Efficient Autonomous Driving· 2023 · 203 citations
Microscopic Traffic Simulation by Cooperative Multi-agent Deep Reinforcement Learning· 2019 · 17 citations
Optimization‐based autonomous racing of 1:43 scale RC cars· 2014 · 518 citations
VisioPath: Vision-Language Enhanced Model Predictive Control for Safe Autonomous Navigation in Mixed Traffic· 2025 · 1 citations
DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving

Authors

Feng Peng

Shangju She

Zejian Deng

Journals

Machines

Actions

Institutions

University of Hong Kong

Chinese University of Hong Kong

Wuhan University of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion