:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Bo, Yin, Yida, Chai, Wenhao, Fu, Xingyu, Liu, Zhuang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2601.22155
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
by: Zhou, Guanyu, et al.
Published: (2026)

From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models
by: Yang, Cheng, et al.
Published: (2026)

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
by: Chen, Liang, et al.
Published: (2025)

Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
by: Peng, Xiangyu, et al.
Published: (2025)

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
by: Jiang, Yulun, et al.
Published: (2025)

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
by: Gan, Ziliang, et al.
Published: (2024)

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
by: Ruan, Jiacheng, et al.
Published: (2025)

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026)

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
by: Niu, Yuwei, et al.
Published: (2025)

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
by: Li, Siqi, et al.
Published: (2025)

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
by: Zhang, Ge, et al.
Published: (2024)

DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
by: Zhou, Yu, et al.
Published: (2025)

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing
by: Su, Xiaoyan, et al.
Published: (2026)

Generative Modeling of Weights: Generalization or Memorization?
by: Zeng, Boya, et al.
Published: (2025)

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)

UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
by: Lee, Segyu, et al.
Published: (2026)

READoc: A Unified Benchmark for Realistic Document Structured Extraction
by: Li, Zichao, et al.
Published: (2024)

MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
by: Liu, Xuannan, et al.
Published: (2024)

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
by: Hu, Yushi, et al.
Published: (2024)

UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
by: Tang, Hongxuan, et al.
Published: (2025)

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
by: Si, Chenglei, et al.
Published: (2024)

MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal
by: Nie, Yiqi, et al.
Published: (2026)

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
by: Wu, Chengyue, et al.
Published: (2024)

MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency
by: Zhang, Junzhe, et al.
Published: (2024)

UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization
by: Zhang, Huixuan, et al.
Published: (2025)

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
by: Yue, Xiang, et al.
Published: (2023)

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
by: Chen, Xiaokang, et al.
Published: (2025)

UniChange: Unifying Change Detection with Multimodal Large Language Model
by: Zhang, Xu, et al.
Published: (2025)

Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation
by: Zhou, Li, et al.
Published: (2025)

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
by: Li, Shilong, et al.
Published: (2025)

TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity
by: Yang, Zheyuan, et al.
Published: (2026)

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
by: Zhu, Kejian, et al.
Published: (2025)

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
by: Li, Jiaang, et al.
Published: (2025)

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
by: Rahman, Mizanur, et al.
Published: (2025)

Train a Unified Multimodal Data Quality Classifier with Synthetic Data
by: Wang, Weizhi, et al.
Published: (2025)

Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
by: Wu, Yin, et al.
Published: (2025)

Insight-A: Attribution-aware for Multimodal Misinformation Detection
by: Wu, Junjie, et al.
Published: (2025)

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
by: Ma, Yiyang, et al.
Published: (2024)