Saved in:
| Main Authors: | Liu, Zeyu, Ni, Zanlin, Yue, Yang, Da, Cheng, Yang, Huan, Zhang, Di, Gai, Kun, Huang, Gao |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.05781 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation
by: Zhang, Xu, et al.
Published: (2026)
by: Zhang, Xu, et al.
Published: (2026)
CODA: Repurposing Continuous VAEs for Discrete Tokenization
by: Liu, Zeyu, et al.
Published: (2025)
by: Liu, Zeyu, et al.
Published: (2025)
Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model
by: Zhou, Renping, et al.
Published: (2025)
by: Zhou, Renping, et al.
Published: (2025)
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
by: Yue, Yang, et al.
Published: (2026)
by: Yue, Yang, et al.
Published: (2026)
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
by: Zhang, Chi, et al.
Published: (2025)
by: Zhang, Chi, et al.
Published: (2025)
UniVideo: Unified Understanding, Generation, and Editing for Videos
by: Wei, Cong, et al.
Published: (2025)
by: Wei, Cong, et al.
Published: (2025)
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
by: Guo, Jiayi, et al.
Published: (2026)
by: Guo, Jiayi, et al.
Published: (2026)
Unified Reward Model for Multimodal Understanding and Generation
by: Wang, Yibin, et al.
Published: (2025)
by: Wang, Yibin, et al.
Published: (2025)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
by: Jin, Yang, et al.
Published: (2024)
by: Jin, Yang, et al.
Published: (2024)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
by: Wu, Size, et al.
Published: (2025)
by: Wu, Size, et al.
Published: (2025)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
by: Jin, Yang, et al.
Published: (2023)
by: Jin, Yang, et al.
Published: (2023)
VINO: A Unified Visual Generator with Interleaved OmniModal Context
by: Chen, Junyi, et al.
Published: (2026)
by: Chen, Junyi, et al.
Published: (2026)
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
by: Zhao, Yue, et al.
Published: (2025)
by: Zhao, Yue, et al.
Published: (2025)
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors
by: Dong, Shiyin, et al.
Published: (2024)
by: Dong, Shiyin, et al.
Published: (2024)
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
by: Ye, Zixuan, et al.
Published: (2025)
by: Ye, Zixuan, et al.
Published: (2025)
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
by: Ni, Zanlin, et al.
Published: (2024)
by: Ni, Zanlin, et al.
Published: (2024)
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
by: Zhang, Tao, et al.
Published: (2025)
by: Zhang, Tao, et al.
Published: (2025)
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
by: Xie, Wulin, et al.
Published: (2025)
by: Xie, Wulin, et al.
Published: (2025)
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
by: Xu, Yiyan, et al.
Published: (2026)
by: Xu, Yiyan, et al.
Published: (2026)
UniMesh: Unifying 3D Mesh Understanding and Generation
by: Huang, Peng, et al.
Published: (2026)
by: Huang, Peng, et al.
Published: (2026)
UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing
by: Ma, Lichen, et al.
Published: (2026)
by: Ma, Lichen, et al.
Published: (2026)
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
by: Jiao, Yang, et al.
Published: (2025)
by: Jiao, Yang, et al.
Published: (2025)
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
by: Zhuang, Xianwei, et al.
Published: (2025)
by: Zhuang, Xianwei, et al.
Published: (2025)
PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth
by: Jin, Bu, et al.
Published: (2025)
by: Jin, Bu, et al.
Published: (2025)
TexEditor: Structure-Preserving Text-Driven Texture Editing
by: Zhao, Bo, et al.
Published: (2026)
by: Zhao, Bo, et al.
Published: (2026)
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
by: Wang, Peiyu, et al.
Published: (2025)
by: Wang, Peiyu, et al.
Published: (2025)
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
by: Liu, Zhiheng, et al.
Published: (2025)
by: Liu, Zhiheng, et al.
Published: (2025)
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
by: Xiao, Yicheng, et al.
Published: (2025)
by: Xiao, Yicheng, et al.
Published: (2025)
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)
by: Tian, Changyao, et al.
Published: (2026)
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
by: Zhang, Huichao, et al.
Published: (2026)
by: Zhang, Huichao, et al.
Published: (2026)
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
by: Diao, Haiwen, et al.
Published: (2026)
by: Diao, Haiwen, et al.
Published: (2026)
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
by: Du, Sinan, et al.
Published: (2025)
by: Du, Sinan, et al.
Published: (2025)
Unified Multimodal Understanding via Byte-Pair Visual Encoding
by: Zhang, Wanpeng, et al.
Published: (2025)
by: Zhang, Wanpeng, et al.
Published: (2025)
Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
by: Pan, Jiadong, et al.
Published: (2026)
by: Pan, Jiadong, et al.
Published: (2026)
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale
by: Gai, Xiaotang, et al.
Published: (2024)
by: Gai, Xiaotang, et al.
Published: (2024)
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
by: Tian, Rui, et al.
Published: (2025)
by: Tian, Rui, et al.
Published: (2025)
AdaGen: Learning Adaptive Policy for Image Synthesis
by: Ni, Zanlin, et al.
Published: (2026)
by: Ni, Zanlin, et al.
Published: (2026)
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
by: Wang, Le, et al.
Published: (2025)
by: Wang, Le, et al.
Published: (2025)
Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models
by: Wang, Kaishen, et al.
Published: (2026)
by: Wang, Kaishen, et al.
Published: (2026)
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
by: Wu, Chengyue, et al.
Published: (2024)
by: Wu, Chengyue, et al.
Published: (2024)
Similar Items
-
ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation
by: Zhang, Xu, et al.
Published: (2026) -
CODA: Repurposing Continuous VAEs for Discrete Tokenization
by: Liu, Zeyu, et al.
Published: (2025) -
Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model
by: Zhou, Renping, et al.
Published: (2025) -
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
by: Yue, Yang, et al.
Published: (2026) -
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
by: Zhang, Chi, et al.
Published: (2025)