Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, David Junhao, Li, Dongxu, Le, Hung, Shou, Mike Zheng, Xiong, Caiming, Sahoo, Doyen
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2401.01827
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909060279828480
author	Zhang, David Junhao Li, Dongxu Le, Hung Shou, Mike Zheng Xiong, Caiming Sahoo, Doyen
author_facet	Zhang, David Junhao Li, Dongxu Le, Hung Shou, Mike Zheng Xiong, Caiming Sahoo, Doyen
contents	Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_01827
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions Zhang, David Junhao Li, Dongxu Le, Hung Shou, Mike Zheng Xiong, Caiming Sahoo, Doyen Computer Vision and Pattern Recognition Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.
title	Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2401.01827

Similar Items