Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Zhao, Qian, Zezhong, Li, Xiaofan, Xu, Weixiang, Zhao, Gongpeng, Yu, Ruohong, Zhu, Lingsi, Liu, Longjun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.03689
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910859736907776
author	Yang, Zhao Qian, Zezhong Li, Xiaofan Xu, Weixiang Zhao, Gongpeng Yu, Ruohong Zhu, Lingsi Liu, Longjun
author_facet	Yang, Zhao Qian, Zezhong Li, Xiaofan Xu, Weixiang Zhao, Gongpeng Yu, Ruohong Zhu, Lingsi Liu, Longjun
contents	Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_03689
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance Yang, Zhao Qian, Zezhong Li, Xiaofan Xu, Weixiang Zhao, Gongpeng Yu, Ruohong Zhu, Lingsi Liu, Longjun Computer Vision and Pattern Recognition Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.
title	DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.03689

Similar Items