Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Zhefan, Jerfel, Ghassen, Haliem, Marina, Zhao, Qi, Kang, Jeonhyung, Refaat, Khaled S.
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.20082
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918511783182336
author	Xu, Zhefan Jerfel, Ghassen Haliem, Marina Zhao, Qi Kang, Jeonhyung Refaat, Khaled S.
author_facet	Xu, Zhefan Jerfel, Ghassen Haliem, Marina Zhao, Qi Kang, Jeonhyung Refaat, Khaled S.
contents	The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_20082
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving Xu, Zhefan Jerfel, Ghassen Haliem, Marina Zhao, Qi Kang, Jeonhyung Refaat, Khaled S. Computer Vision and Pattern Recognition Artificial Intelligence The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.
title	VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2605.20082

Similar Items