Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.01303 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911414278422528 |
|---|---|
| author | Sarkar, Ayushman Yu, Zhenyu Chen, Chu Tang, Wei Cui, Kangning Idris, Mohd Yamani Idna |
| author_facet | Sarkar, Ayushman Yu, Zhenyu Chen, Chu Tang, Wei Cui, Kangning Idris, Mohd Yamani Idna |
| contents | Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_01303 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation Sarkar, Ayushman Yu, Zhenyu Chen, Chu Tang, Wei Cui, Kangning Idris, Mohd Yamani Idna Computer Vision and Pattern Recognition Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory |
| title | ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2602.01303 |