Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ma, Yingzi, Cao, Yulong, Ding, Wenhao, Zhang, Shuibai, Wang, Yan, Ivanovic, Boris, Jiang, Ming, Pavone, Marco, Xiao, Chaowei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.04459
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909943348592640
author	Ma, Yingzi Cao, Yulong Ding, Wenhao Zhang, Shuibai Wang, Yan Ivanovic, Boris Jiang, Ming Pavone, Marco Xiao, Chaowei
author_facet	Ma, Yingzi Cao, Yulong Ding, Wenhao Zhang, Shuibai Wang, Yan Ivanovic, Boris Jiang, Ming Pavone, Marco Xiao, Chaowei
contents	The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_04459
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning Ma, Yingzi Cao, Yulong Ding, Wenhao Zhang, Shuibai Wang, Yan Ivanovic, Boris Jiang, Ming Pavone, Marco Xiao, Chaowei Computer Vision and Pattern Recognition The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
title	dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.04459

Similar Items