Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Haiming, Zhou, Junfei, Jiang, Feng, Li, Jingzhong, Guo, Zhenglong, Dai, Penglin, Dai, Jifeng, Xie, Yan, Zhu, Benjin
Format:	Preprint
Published:	2026
Subjects:	Robotics Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.26113
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913161883418624
author	Zhang, Haiming Zhou, Junfei Jiang, Feng Li, Jingzhong Guo, Zhenglong Dai, Penglin Dai, Jifeng Xie, Yan Zhu, Benjin
author_facet	Zhang, Haiming Zhou, Junfei Jiang, Feng Li, Jingzhong Guo, Zhenglong Dai, Penglin Dai, Jifeng Xie, Yan Zhu, Benjin
contents	Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_26113
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond Zhang, Haiming Zhou, Junfei Jiang, Feng Li, Jingzhong Guo, Zhenglong Dai, Penglin Dai, Jifeng Xie, Yan Zhu, Benjin Robotics Computer Vision and Pattern Recognition Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.
title	AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond
topic	Robotics Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.26113

Similar Items