Saved in:
Bibliographic Details
Main Authors: Xu, Zhefan, Jerfel, Ghassen, Haliem, Marina, Zhao, Qi, Kang, Jeonhyung, Refaat, Khaled S.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.20082
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918511783182336
author Xu, Zhefan
Jerfel, Ghassen
Haliem, Marina
Zhao, Qi
Kang, Jeonhyung
Refaat, Khaled S.
author_facet Xu, Zhefan
Jerfel, Ghassen
Haliem, Marina
Zhao, Qi
Kang, Jeonhyung
Refaat, Khaled S.
contents The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.
format Preprint
id arxiv_https___arxiv_org_abs_2605_20082
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
Xu, Zhefan
Jerfel, Ghassen
Haliem, Marina
Zhao, Qi
Kang, Jeonhyung
Refaat, Khaled S.
Computer Vision and Pattern Recognition
Artificial Intelligence
The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.
title VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2605.20082