Saved in:
Bibliographic Details
Main Authors: Xu, Jiawei, Zhong, Zhizhou, Shu, Zhijian, Jia, Mingkai, Li, Mingxiao, Bian, Jia-Wang, Zhang, Qian, Zhang, Kaicheng, Xie, Jin, Yang, Jian, Yin, Wei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.14696
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914567265714176
author Xu, Jiawei
Zhong, Zhizhou
Shu, Zhijian
Jia, Mingkai
Li, Mingxiao
Bian, Jia-Wang
Zhang, Qian
Zhang, Kaicheng
Xie, Jin
Yang, Jian
Yin, Wei
author_facet Xu, Jiawei
Zhong, Zhizhou
Shu, Zhijian
Jia, Mingkai
Li, Mingxiao
Bian, Jia-Wang
Zhang, Qian
Zhang, Kaicheng
Xie, Jin
Yang, Jian
Yin, Wei
contents Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.
format Preprint
id arxiv_https___arxiv_org_abs_2605_14696
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle EponaV2: Driving World Model with Comprehensive Future Reasoning
Xu, Jiawei
Zhong, Zhizhou
Shu, Zhijian
Jia, Mingkai
Li, Mingxiao
Bian, Jia-Wang
Zhang, Qian
Zhang, Kaicheng
Xie, Jin
Yang, Jian
Yin, Wei
Computer Vision and Pattern Recognition
Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.
title EponaV2: Driving World Model with Comprehensive Future Reasoning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.14696