Saved in:
Bibliographic Details
Main Authors: Kwon, Patrick, Chen, Chen
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.01686
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915647070404608
author Kwon, Patrick
Chen, Chen
author_facet Kwon, Patrick
Chen, Chen
contents Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/
format Preprint
id arxiv_https___arxiv_org_abs_2512_01686
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Kwon, Patrick
Chen, Chen
Computer Vision and Pattern Recognition
Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/
title DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.01686