Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kang, Taewon, Kothandaraman, Divya, Lin, Ming C.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.06310
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911697677058048
author	Kang, Taewon Kothandaraman, Divya Lin, Ming C.
author_facet	Kang, Taewon Kothandaraman, Divya Lin, Ming C.
contents	Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_06310
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling Kang, Taewon Kothandaraman, Divya Lin, Ming C. Computer Vision and Pattern Recognition Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.
title	Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.06310

Similar Items