Saved in:
Bibliographic Details
Main Authors: Song, Yuehao, Chen, Shaoyu, Gao, Hao, Zhu, Yifan, Yue, Weixiang, Zou, Jialv, Jiang, Bo, Lu, Zihao, Wang, Yu, Zhang, Qian, Wang, Xinggang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.11219
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918383734226944
author Song, Yuehao
Chen, Shaoyu
Gao, Hao
Zhu, Yifan
Yue, Weixiang
Zou, Jialv
Jiang, Bo
Lu, Zihao
Wang, Yu
Zhang, Qian
Wang, Xinggang
author_facet Song, Yuehao
Chen, Shaoyu
Gao, Hao
Zhu, Yifan
Yue, Weixiang
Zou, Jialv
Jiang, Bo
Lu, Zihao
Wang, Yu
Zhang, Qian
Wang, Xinggang
contents Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).
format Preprint
id arxiv_https___arxiv_org_abs_2603_11219
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning
Song, Yuehao
Chen, Shaoyu
Gao, Hao
Zhu, Yifan
Yue, Weixiang
Zou, Jialv
Jiang, Bo
Lu, Zihao
Wang, Yu
Zhang, Qian
Wang, Xinggang
Computer Vision and Pattern Recognition
Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).
title Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.11219