:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yabo, Li, Kunchang, Zhou, Dewei, Huang, Xinyu, Wang, Xun
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.12305
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
by: Liao, Chao, et al.
Published: (2025)

UniFormer: Unifying Convolution and Self-attention for Visual Recognition
by: Li, Kunchang, et al.
Published: (2022)

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
by: Liu, Jie, et al.
Published: (2026)

Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
by: Chen, Dongping, et al.
Published: (2024)

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
by: Zhou, Chao, et al.
Published: (2025)

VINO: A Unified Visual Generator with Interleaved OmniModal Context
by: Chen, Junyi, et al.
Published: (2026)

OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions
by: Bu, Wendong, et al.
Published: (2025)

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
by: Liu, Qingyang, et al.
Published: (2026)

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
by: Zhou, Dewei, et al.
Published: (2025)

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
by: Nie, Ming, et al.
Published: (2026)

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
by: Zhou, Dewei, et al.
Published: (2024)

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)

Multi-Sentence Grounding for Long-term Instructional Video
by: Li, Zeqian, et al.
Published: (2023)

MANTIS: Interleaved Multi-Image Instruction Tuning
by: Jiang, Dongfu, et al.
Published: (2024)

MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis
by: Zhou, Dewei, et al.
Published: (2024)

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
by: Lin, Weifeng, et al.
Published: (2024)

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
by: Chen, Haoyu, et al.
Published: (2026)

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
by: Li, Xinhao, et al.
Published: (2024)

3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation
by: Zhou, Dewei, et al.
Published: (2024)

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
by: Li, Yanlin, et al.
Published: (2026)

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
by: Wang, Junke, et al.
Published: (2025)

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
by: Zhou, Dewei, et al.
Published: (2025)

Holistic Evaluation for Interleaved Text-and-Image Generation
by: Liu, Minqian, et al.
Published: (2024)

Show Me: Unifying Instructional Image and Video Generation with Diffusion Models
by: Pu, Yujiang, et al.
Published: (2025)

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
by: Guo, Ziyu, et al.
Published: (2025)

A Unified and Controllable Framework for Layered Image Generation with Visual Effects
by: Yang, Jinrui, et al.
Published: (2026)

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
by: Zhang, Lei, et al.
Published: (2026)

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
by: Ding, Yanbo, et al.
Published: (2024)

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
by: Chen, Xinyan, et al.
Published: (2025)

Causal Diffusion Transformers for Generative Modeling
by: Deng, Chaorui, et al.
Published: (2024)

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
by: Feng, Yukang, et al.
Published: (2025)

Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval
by: Du, Yongchao, et al.
Published: (2024)

ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
by: Wang, Lihong, et al.
Published: (2025)

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
by: Li, Qingyun, et al.
Published: (2024)

ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
by: Yu, Haodong, et al.
Published: (2026)

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
by: Wang, Zeyu, et al.
Published: (2025)

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
by: Chow, Wei, et al.
Published: (2025)

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
by: Li, Zhang, et al.
Published: (2025)

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
by: Dong, Shuai, et al.
Published: (2025)

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
by: Hong, Lingyi, et al.
Published: (2026)