Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.20082 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918511783182336 |
|---|---|
| author | Xu, Zhefan Jerfel, Ghassen Haliem, Marina Zhao, Qi Kang, Jeonhyung Refaat, Khaled S. |
| author_facet | Xu, Zhefan Jerfel, Ghassen Haliem, Marina Zhao, Qi Kang, Jeonhyung Refaat, Khaled S. |
| contents | The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_20082 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving Xu, Zhefan Jerfel, Ghassen Haliem, Marina Zhao, Qi Kang, Jeonhyung Refaat, Khaled S. Computer Vision and Pattern Recognition Artificial Intelligence The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model. |
| title | VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence |
| url | https://arxiv.org/abs/2605.20082 |