Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dahmani, Hiba, Piasco, Nathan, Bennehar, Moussab, Roldão, Luis, Tsishkou, Dzmitry, Caraffa, Laurent, Tarel, Jean-Philippe, Brémond, Roland
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.06113
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908944359751680
author	Dahmani, Hiba Piasco, Nathan Bennehar, Moussab Roldão, Luis Tsishkou, Dzmitry Caraffa, Laurent Tarel, Jean-Philippe Brémond, Roland
author_facet	Dahmani, Hiba Piasco, Nathan Bennehar, Moussab Roldão, Luis Tsishkou, Dzmitry Caraffa, Laurent Tarel, Jean-Philippe Brémond, Roland
contents	Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $Σ$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_06113
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation Dahmani, Hiba Piasco, Nathan Bennehar, Moussab Roldão, Luis Tsishkou, Dzmitry Caraffa, Laurent Tarel, Jean-Philippe Brémond, Roland Computer Vision and Pattern Recognition Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $Σ$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.
title	SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.06113

Similar Items