Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.04459 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909943348592640 |
|---|---|
| author | Ma, Yingzi Cao, Yulong Ding, Wenhao Zhang, Shuibai Wang, Yan Ivanovic, Boris Jiang, Ming Pavone, Marco Xiao, Chaowei |
| author_facet | Ma, Yingzi Cao, Yulong Ding, Wenhao Zhang, Shuibai Wang, Yan Ivanovic, Boris Jiang, Ming Pavone, Marco Xiao, Chaowei |
| contents | The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_04459 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning Ma, Yingzi Cao, Yulong Ding, Wenhao Zhang, Shuibai Wang, Yan Ivanovic, Boris Jiang, Ming Pavone, Marco Xiao, Chaowei Computer Vision and Pattern Recognition The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving. |
| title | dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2512.04459 |