Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Xunnong, Cao, Mengying
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2412.09828
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916521615294464
author	Xu, Xunnong Cao, Mengying
author_facet	Xu, Xunnong Cao, Mengying
contents	Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_09828
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion Xu, Xunnong Cao, Mengying Computer Vision and Pattern Recognition Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.
title	MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2412.09828

Similar Items