Saved in:
Bibliographic Details
Main Authors: Chen, Haoyu, Liu, Qing, Zhou, Yuqian, Zhang, He, Wang, Zhaowen, Ren, Mengwei, Ren, Jingjing, Wang, Xiang, Lin, Zhe, Zhu, Lei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.07540
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908871840235520
author Chen, Haoyu
Liu, Qing
Zhou, Yuqian
Zhang, He
Wang, Zhaowen
Ren, Mengwei
Ren, Jingjing
Wang, Xiang
Lin, Zhe
Zhu, Lei
author_facet Chen, Haoyu
Liu, Qing
Zhou, Yuqian
Zhang, He
Wang, Zhaowen
Ren, Mengwei
Ren, Jingjing
Wang, Xiang
Lin, Zhe
Zhu, Lei
contents Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
format Preprint
id arxiv_https___arxiv_org_abs_2603_07540
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
Chen, Haoyu
Liu, Qing
Zhou, Yuqian
Zhang, He
Wang, Zhaowen
Ren, Mengwei
Ren, Jingjing
Wang, Xiang
Lin, Zhe
Zhu, Lei
Computer Vision and Pattern Recognition
Artificial Intelligence
Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
title How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2603.07540