September 28, 2025Open Access

LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving

Key Points

The proposed framework significantly boosts performance on driving reasoning tasks, establishing a new standard in the field.
Experiments on the DriveLM and nuScenes-QA datasets showed marked improvements in scene recognition capabilities.
The approach integrates advanced scene understanding with task-specialized structures tailored for autonomous driving.
Results suggest that integrating vision-language models with comprehensive spatial awareness is crucial for complex driving scenarios.

Abstract

Large vision-language models (VLMs) have shown promising capabilities in scene understanding, enhancing the explainability of driving behaviors and interactivity with users. Existing methods primarily fine-tune VLMs on on-board multi-view images and scene reasoning text, but this approach often lacks the holistic and nuanced scene recognition and powerful spatial awareness required for autonomous driving, especially in complex situations. To address this gap, we propose a novel vision-language framework tailored for autonomous driving, called LMAD. Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs. In particular, we introduce preliminary scene interaction and specialized expert adapters within the same driving task structure, which better align VLMs with autonomous driving scenarios. Furthermore, our approach is designed to be fully compatible with existing VLMs while seamlessly integrating with planning-oriented driving systems. Extensive experiments on the DriveLM and nuScenes-QA datasets demonstrate that LMAD significantly boosts the performance of existing VLMs on driving reasoning tasks,setting a new standard in explainable autonomous driving.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Song et al. (Sun,) studied this question.

www.synapsesocial.com/papers/68d913a34ddcf71ba560b8ee — DOI: https://doi.org/10.48550/arxiv.2508.12404

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Nan Song

Bozhou Zhang

Xiatian Zhu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion