Saved in:
Bibliographic Details
Main Authors: Sarkar, Ayushman, Yu, Zhenyu, Chen, Chu, Tang, Wei, Cui, Kangning, Idris, Mohd Yamani Idna
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.01303
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911414278422528
author Sarkar, Ayushman
Yu, Zhenyu
Chen, Chu
Tang, Wei
Cui, Kangning
Idris, Mohd Yamani Idna
author_facet Sarkar, Ayushman
Yu, Zhenyu
Chen, Chu
Tang, Wei
Cui, Kangning
Idris, Mohd Yamani Idna
contents Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory
format Preprint
id arxiv_https___arxiv_org_abs_2602_01303
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation
Sarkar, Ayushman
Yu, Zhenyu
Chen, Chu
Tang, Wei
Cui, Kangning
Idris, Mohd Yamani Idna
Computer Vision and Pattern Recognition
Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory
title ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.01303