Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cong, Yuren, Xu, Mengmeng, Simon, Christian, Chen, Shoufa, Ren, Jiawei, Xie, Yanping, Perez-Rua, Juan-Manuel, Rosenhahn, Bodo, Xiang, Tao, He, Sen
Format:	Preprint
Published:	2023
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2310.05922
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911786656071680
author	Cong, Yuren Xu, Mengmeng Simon, Christian Chen, Shoufa Ren, Jiawei Xie, Yanping Perez-Rua, Juan-Manuel Rosenhahn, Bodo Xiang, Tao He, Sen
author_facet	Cong, Yuren Xu, Mengmeng Simon, Christian Chen, Shoufa Ren, Jiawei Xie, Yanping Perez-Rua, Juan-Manuel Rosenhahn, Bodo Xiang, Tao He, Sen
contents	Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.
format	Preprint
id	arxiv_https___arxiv_org_abs_2310_05922
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing Cong, Yuren Xu, Mengmeng Simon, Christian Chen, Shoufa Ren, Jiawei Xie, Yanping Perez-Rua, Juan-Manuel Rosenhahn, Bodo Xiang, Tao He, Sen Computer Vision and Pattern Recognition Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.
title	FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2310.05922

Similar Items