Saved in:
Bibliographic Details
Main Authors: Ma, Yingzi, Cao, Yulong, Ding, Wenhao, Zhang, Shuibai, Wang, Yan, Ivanovic, Boris, Jiang, Ming, Pavone, Marco, Xiao, Chaowei
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.04459
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909943348592640
author Ma, Yingzi
Cao, Yulong
Ding, Wenhao
Zhang, Shuibai
Wang, Yan
Ivanovic, Boris
Jiang, Ming
Pavone, Marco
Xiao, Chaowei
author_facet Ma, Yingzi
Cao, Yulong
Ding, Wenhao
Zhang, Shuibai
Wang, Yan
Ivanovic, Boris
Jiang, Ming
Pavone, Marco
Xiao, Chaowei
contents The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
format Preprint
id arxiv_https___arxiv_org_abs_2512_04459
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning
Ma, Yingzi
Cao, Yulong
Ding, Wenhao
Zhang, Shuibai
Wang, Yan
Ivanovic, Boris
Jiang, Ming
Pavone, Marco
Xiao, Chaowei
Computer Vision and Pattern Recognition
The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
title dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.04459