Saved in:
Bibliographic Details
Main Authors: Li, Bing, Zheng, Cheng, Zhu, Wenxuan, Mai, Jinjie, Zhang, Biao, Wonka, Peter, Ghanem, Bernard
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.08659
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910484924465152
author Li, Bing
Zheng, Cheng
Zhu, Wenxuan
Mai, Jinjie
Zhang, Biao
Wonka, Peter
Ghanem, Bernard
author_facet Li, Bing
Zheng, Cheng
Zhu, Wenxuan
Mai, Jinjie
Zhang, Biao
Wonka, Peter
Ghanem, Bernard
contents While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.
format Preprint
id arxiv_https___arxiv_org_abs_2406_08659
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Vivid-ZOO: Multi-View Video Generation with Diffusion Model
Li, Bing
Zheng, Cheng
Zhu, Wenxuan
Mai, Jinjie
Zhang, Biao
Wonka, Peter
Ghanem, Bernard
Computer Vision and Pattern Recognition
While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.
title Vivid-ZOO: Multi-View Video Generation with Diffusion Model
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2406.08659