Saved in:
Bibliographic Details
Main Authors: Kang, Taewon, Kothandaraman, Divya, Lin, Ming C.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.06310
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911697677058048
author Kang, Taewon
Kothandaraman, Divya
Lin, Ming C.
author_facet Kang, Taewon
Kothandaraman, Divya
Lin, Ming C.
contents Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.
format Preprint
id arxiv_https___arxiv_org_abs_2503_06310
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling
Kang, Taewon
Kothandaraman, Divya
Lin, Ming C.
Computer Vision and Pattern Recognition
Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.
title Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.06310