Saved in:
| Main Authors: | Zhang, Yabo, Li, Kunchang, Zhou, Dewei, Huang, Xinyu, Wang, Xun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.12305 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
by: Liao, Chao, et al.
Published: (2025)
by: Liao, Chao, et al.
Published: (2025)
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
by: Li, Kunchang, et al.
Published: (2022)
by: Li, Kunchang, et al.
Published: (2022)
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
by: Liu, Jie, et al.
Published: (2026)
by: Liu, Jie, et al.
Published: (2026)
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
by: Zhou, Chao, et al.
Published: (2025)
by: Zhou, Chao, et al.
Published: (2025)
VINO: A Unified Visual Generator with Interleaved OmniModal Context
by: Chen, Junyi, et al.
Published: (2026)
by: Chen, Junyi, et al.
Published: (2026)
OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions
by: Bu, Wendong, et al.
Published: (2025)
by: Bu, Wendong, et al.
Published: (2025)
Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
by: Liu, Qingyang, et al.
Published: (2026)
by: Liu, Qingyang, et al.
Published: (2026)
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
by: Zhou, Dewei, et al.
Published: (2025)
by: Zhou, Dewei, et al.
Published: (2025)
Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
by: Nie, Ming, et al.
Published: (2026)
by: Nie, Ming, et al.
Published: (2026)
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
by: Zhou, Dewei, et al.
Published: (2024)
by: Zhou, Dewei, et al.
Published: (2024)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)
by: Tian, Changyao, et al.
Published: (2024)
Multi-Sentence Grounding for Long-term Instructional Video
by: Li, Zeqian, et al.
Published: (2023)
by: Li, Zeqian, et al.
Published: (2023)
MANTIS: Interleaved Multi-Image Instruction Tuning
by: Jiang, Dongfu, et al.
Published: (2024)
by: Jiang, Dongfu, et al.
Published: (2024)
MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis
by: Zhou, Dewei, et al.
Published: (2024)
by: Zhou, Dewei, et al.
Published: (2024)
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
by: Lin, Weifeng, et al.
Published: (2024)
by: Lin, Weifeng, et al.
Published: (2024)
How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
by: Chen, Haoyu, et al.
Published: (2026)
by: Chen, Haoyu, et al.
Published: (2026)
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
by: Li, Xinhao, et al.
Published: (2024)
by: Li, Xinhao, et al.
Published: (2024)
3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation
by: Zhou, Dewei, et al.
Published: (2024)
by: Zhou, Dewei, et al.
Published: (2024)
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
by: Li, Yanlin, et al.
Published: (2026)
by: Li, Yanlin, et al.
Published: (2026)
SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
by: Wang, Junke, et al.
Published: (2025)
by: Wang, Junke, et al.
Published: (2025)
BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
by: Zhou, Dewei, et al.
Published: (2025)
by: Zhou, Dewei, et al.
Published: (2025)
Holistic Evaluation for Interleaved Text-and-Image Generation
by: Liu, Minqian, et al.
Published: (2024)
by: Liu, Minqian, et al.
Published: (2024)
Show Me: Unifying Instructional Image and Video Generation with Diffusion Models
by: Pu, Yujiang, et al.
Published: (2025)
by: Pu, Yujiang, et al.
Published: (2025)
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
by: Guo, Ziyu, et al.
Published: (2025)
by: Guo, Ziyu, et al.
Published: (2025)
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
by: Yang, Jinrui, et al.
Published: (2026)
by: Yang, Jinrui, et al.
Published: (2026)
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
by: Zhang, Lei, et al.
Published: (2026)
by: Zhang, Lei, et al.
Published: (2026)
MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
by: Ding, Yanbo, et al.
Published: (2024)
by: Ding, Yanbo, et al.
Published: (2024)
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
by: Chen, Xinyan, et al.
Published: (2025)
by: Chen, Xinyan, et al.
Published: (2025)
Causal Diffusion Transformers for Generative Modeling
by: Deng, Chaorui, et al.
Published: (2024)
by: Deng, Chaorui, et al.
Published: (2024)
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
by: Feng, Yukang, et al.
Published: (2025)
by: Feng, Yukang, et al.
Published: (2025)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval
by: Du, Yongchao, et al.
Published: (2024)
by: Du, Yongchao, et al.
Published: (2024)
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
by: Wang, Lihong, et al.
Published: (2025)
by: Wang, Lihong, et al.
Published: (2025)
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
by: Li, Qingyun, et al.
Published: (2024)
by: Li, Qingyun, et al.
Published: (2024)
ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
by: Yu, Haodong, et al.
Published: (2026)
by: Yu, Haodong, et al.
Published: (2026)
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
by: Wang, Zeyu, et al.
Published: (2025)
by: Wang, Zeyu, et al.
Published: (2025)
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
by: Chow, Wei, et al.
Published: (2025)
by: Chow, Wei, et al.
Published: (2025)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
by: Li, Zhang, et al.
Published: (2025)
by: Li, Zhang, et al.
Published: (2025)
Interleaved Latent Visual Reasoning with Selective Perceptual Modeling
by: Dong, Shuai, et al.
Published: (2025)
by: Dong, Shuai, et al.
Published: (2025)
Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
by: Hong, Lingyi, et al.
Published: (2026)
by: Hong, Lingyi, et al.
Published: (2026)
Similar Items
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
by: Liao, Chao, et al.
Published: (2025) -
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
by: Li, Kunchang, et al.
Published: (2022) -
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
by: Liu, Jie, et al.
Published: (2026) -
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
by: Chen, Dongping, et al.
Published: (2024) -
Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
by: Zhou, Chao, et al.
Published: (2025)