Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yatim, Danah, Fridman, Rafail, Bar-Tal, Omer, Dekel, Tali
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.03621
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908600952160256
author	Yatim, Danah Fridman, Rafail Bar-Tal, Omer Dekel, Tali
author_facet	Yatim, Danah Fridman, Rafail Bar-Tal, Omer Dekel, Tali
contents	We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_03621
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	DynVFX: Augmenting Real Videos with Dynamic Content Yatim, Danah Fridman, Rafail Bar-Tal, Omer Dekel, Tali Computer Vision and Pattern Recognition We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.
title	DynVFX: Augmenting Real Videos with Dynamic Content
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.03621

Similar Items