Saved in:
Bibliographic Details
Main Authors: Zhang, Haiming, Zhou, Junfei, Jiang, Feng, Li, Jingzhong, Guo, Zhenglong, Dai, Penglin, Dai, Jifeng, Xie, Yan, Zhu, Benjin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.26113
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913161883418624
author Zhang, Haiming
Zhou, Junfei
Jiang, Feng
Li, Jingzhong
Guo, Zhenglong
Dai, Penglin
Dai, Jifeng
Xie, Yan
Zhu, Benjin
author_facet Zhang, Haiming
Zhou, Junfei
Jiang, Feng
Li, Jingzhong
Guo, Zhenglong
Dai, Penglin
Dai, Jifeng
Xie, Yan
Zhu, Benjin
contents Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.
format Preprint
id arxiv_https___arxiv_org_abs_2605_26113
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond
Zhang, Haiming
Zhou, Junfei
Jiang, Feng
Li, Jingzhong
Guo, Zhenglong
Dai, Penglin
Dai, Jifeng
Xie, Yan
Zhu, Benjin
Robotics
Computer Vision and Pattern Recognition
Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.
title AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond
topic Robotics
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.26113