:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Zeyu, Ni, Zanlin, Yue, Yang, Da, Cheng, Yang, Huan, Zhang, Di, Gai, Kun, Huang, Gao
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.05781
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation
by: Zhang, Xu, et al.
Published: (2026)

CODA: Repurposing Continuous VAEs for Discrete Tokenization
by: Liu, Zeyu, et al.
Published: (2025)

Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model
by: Zhou, Renping, et al.
Published: (2025)

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
by: Yue, Yang, et al.
Published: (2026)

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
by: Zhang, Chi, et al.
Published: (2025)

UniVideo: Unified Understanding, Generation, and Editing for Videos
by: Wei, Cong, et al.
Published: (2025)

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
by: Guo, Jiayi, et al.
Published: (2026)

Unified Reward Model for Multimodal Understanding and Generation
by: Wang, Yibin, et al.
Published: (2025)

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
by: Jin, Yang, et al.
Published: (2024)

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
by: Wu, Size, et al.
Published: (2025)

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
by: Jin, Yang, et al.
Published: (2023)

VINO: A Unified Visual Generator with Interleaved OmniModal Context
by: Chen, Junyi, et al.
Published: (2026)

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
by: Zhao, Yue, et al.
Published: (2025)

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors
by: Dong, Shiyin, et al.
Published: (2024)

Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
by: Ye, Zixuan, et al.
Published: (2025)

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
by: Ni, Zanlin, et al.
Published: (2024)

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
by: Zhang, Tao, et al.
Published: (2025)

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
by: Xie, Wulin, et al.
Published: (2025)

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
by: Xu, Yiyan, et al.
Published: (2026)

UniMesh: Unifying 3D Mesh Understanding and Generation
by: Huang, Peng, et al.
Published: (2026)

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing
by: Ma, Lichen, et al.
Published: (2026)

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
by: Jiao, Yang, et al.
Published: (2025)

VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
by: Zhuang, Xianwei, et al.
Published: (2025)

PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth
by: Jin, Bu, et al.
Published: (2025)

TexEditor: Structure-Preserving Text-Driven Texture Editing
by: Zhao, Bo, et al.
Published: (2026)

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
by: Wang, Peiyu, et al.
Published: (2025)

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
by: Liu, Zhiheng, et al.
Published: (2025)

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
by: Xiao, Yicheng, et al.
Published: (2025)

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
by: Zhang, Huichao, et al.
Published: (2026)

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
by: Diao, Haiwen, et al.
Published: (2026)

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
by: Du, Sinan, et al.
Published: (2025)

Unified Multimodal Understanding via Byte-Pair Visual Encoding
by: Zhang, Wanpeng, et al.
Published: (2025)

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
by: Pan, Jiadong, et al.
Published: (2026)

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale
by: Gai, Xiaotang, et al.
Published: (2024)

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
by: Tian, Rui, et al.
Published: (2025)

AdaGen: Learning Adaptive Policy for Image Synthesis
by: Ni, Zanlin, et al.
Published: (2026)

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
by: Wang, Le, et al.
Published: (2025)

Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models
by: Wang, Kaishen, et al.
Published: (2026)

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
by: Wu, Chengyue, et al.
Published: (2024)