Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Bing, Zheng, Cheng, Zhu, Wenxuan, Mai, Jinjie, Zhang, Biao, Wonka, Peter, Ghanem, Bernard
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2406.08659
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910484924465152
author	Li, Bing Zheng, Cheng Zhu, Wenxuan Mai, Jinjie Zhang, Biao Wonka, Peter Ghanem, Bernard
author_facet	Li, Bing Zheng, Cheng Zhu, Wenxuan Mai, Jinjie Zhang, Biao Wonka, Peter Ghanem, Bernard
contents	While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_08659
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Vivid-ZOO: Multi-View Video Generation with Diffusion Model Li, Bing Zheng, Cheng Zhu, Wenxuan Mai, Jinjie Zhang, Biao Wonka, Peter Ghanem, Bernard Computer Vision and Pattern Recognition While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.
title	Vivid-ZOO: Multi-View Video Generation with Diffusion Model
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2406.08659

Similar Items