Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Haoyu, Liu, Qing, Zhou, Yuqian, Zhang, He, Wang, Zhaowen, Ren, Mengwei, Ren, Jingjing, Wang, Xiang, Lin, Zhe, Zhu, Lei
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.07540
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908871840235520
author	Chen, Haoyu Liu, Qing Zhou, Yuqian Zhang, He Wang, Zhaowen Ren, Mengwei Ren, Jingjing Wang, Xiang Lin, Zhe Zhu, Lei
author_facet	Chen, Haoyu Liu, Qing Zhou, Yuqian Zhang, He Wang, Zhaowen Ren, Mengwei Ren, Jingjing Wang, Xiang Lin, Zhe Zhu, Lei
contents	Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_07540
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation Chen, Haoyu Liu, Qing Zhou, Yuqian Zhang, He Wang, Zhaowen Ren, Mengwei Ren, Jingjing Wang, Xiang Lin, Zhe Zhu, Lei Computer Vision and Pattern Recognition Artificial Intelligence Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
title	How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2603.07540

Similar Items