Saved in:
| Main Authors: | Li, Bo, Yin, Yida, Chai, Wenhao, Fu, Xingyu, Liu, Zhuang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.22155 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
by: Zhou, Guanyu, et al.
Published: (2026)
by: Zhou, Guanyu, et al.
Published: (2026)
From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models
by: Yang, Cheng, et al.
Published: (2026)
by: Yang, Cheng, et al.
Published: (2026)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
by: Chen, Liang, et al.
Published: (2025)
by: Chen, Liang, et al.
Published: (2025)
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025)
by: Li, Chenxu, et al.
Published: (2025)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
by: Peng, Xiangyu, et al.
Published: (2025)
by: Peng, Xiangyu, et al.
Published: (2025)
MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
by: Jiang, Yulun, et al.
Published: (2025)
by: Jiang, Yulun, et al.
Published: (2025)
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
by: Gan, Ziliang, et al.
Published: (2024)
by: Gan, Ziliang, et al.
Published: (2024)
MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
by: Ruan, Jiacheng, et al.
Published: (2025)
by: Ruan, Jiacheng, et al.
Published: (2025)
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
by: Jiang, Houcheng, et al.
Published: (2026)
by: Jiang, Houcheng, et al.
Published: (2026)
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
by: Niu, Yuwei, et al.
Published: (2025)
by: Niu, Yuwei, et al.
Published: (2025)
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
by: Li, Siqi, et al.
Published: (2025)
by: Li, Siqi, et al.
Published: (2025)
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
by: Zhang, Ge, et al.
Published: (2024)
by: Zhang, Ge, et al.
Published: (2024)
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
by: Zhou, Yu, et al.
Published: (2025)
by: Zhou, Yu, et al.
Published: (2025)
VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing
by: Su, Xiaoyan, et al.
Published: (2026)
by: Su, Xiaoyan, et al.
Published: (2026)
Generative Modeling of Weights: Generalization or Memorization?
by: Zeng, Boya, et al.
Published: (2025)
by: Zeng, Boya, et al.
Published: (2025)
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)
by: Jiang, Jingjing, et al.
Published: (2025)
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
by: Lee, Segyu, et al.
Published: (2026)
by: Lee, Segyu, et al.
Published: (2026)
READoc: A Unified Benchmark for Realistic Document Structured Extraction
by: Li, Zichao, et al.
Published: (2024)
by: Li, Zichao, et al.
Published: (2024)
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
by: Liu, Xuannan, et al.
Published: (2024)
by: Liu, Xuannan, et al.
Published: (2024)
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
by: Hu, Yushi, et al.
Published: (2024)
by: Hu, Yushi, et al.
Published: (2024)
UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
by: Tang, Hongxuan, et al.
Published: (2025)
by: Tang, Hongxuan, et al.
Published: (2025)
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
by: Si, Chenglei, et al.
Published: (2024)
by: Si, Chenglei, et al.
Published: (2024)
MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal
by: Nie, Yiqi, et al.
Published: (2026)
by: Nie, Yiqi, et al.
Published: (2026)
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
by: Wu, Chengyue, et al.
Published: (2024)
by: Wu, Chengyue, et al.
Published: (2024)
MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency
by: Zhang, Junzhe, et al.
Published: (2024)
by: Zhang, Junzhe, et al.
Published: (2024)
UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization
by: Zhang, Huixuan, et al.
Published: (2025)
by: Zhang, Huixuan, et al.
Published: (2025)
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
by: Yue, Xiang, et al.
Published: (2023)
by: Yue, Xiang, et al.
Published: (2023)
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
by: Chen, Xiaokang, et al.
Published: (2025)
by: Chen, Xiaokang, et al.
Published: (2025)
UniChange: Unifying Change Detection with Multimodal Large Language Model
by: Zhang, Xu, et al.
Published: (2025)
by: Zhang, Xu, et al.
Published: (2025)
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation
by: Zhou, Li, et al.
Published: (2025)
by: Zhou, Li, et al.
Published: (2025)
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
by: Li, Shilong, et al.
Published: (2025)
by: Li, Shilong, et al.
Published: (2025)
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity
by: Yang, Zheyuan, et al.
Published: (2026)
by: Yang, Zheyuan, et al.
Published: (2026)
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
by: Zhu, Kejian, et al.
Published: (2025)
by: Zhu, Kejian, et al.
Published: (2025)
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
by: Li, Jiaang, et al.
Published: (2025)
by: Li, Jiaang, et al.
Published: (2025)
Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
by: Rahman, Mizanur, et al.
Published: (2025)
by: Rahman, Mizanur, et al.
Published: (2025)
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
by: Wang, Weizhi, et al.
Published: (2025)
by: Wang, Weizhi, et al.
Published: (2025)
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
by: Wu, Yin, et al.
Published: (2025)
by: Wu, Yin, et al.
Published: (2025)
Insight-A: Attribution-aware for Multimodal Misinformation Detection
by: Wu, Junjie, et al.
Published: (2025)
by: Wu, Junjie, et al.
Published: (2025)
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
by: Ma, Yiyang, et al.
Published: (2024)
by: Ma, Yiyang, et al.
Published: (2024)
Similar Items
-
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
by: Zhou, Guanyu, et al.
Published: (2026) -
From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models
by: Yang, Cheng, et al.
Published: (2026) -
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
by: Chen, Liang, et al.
Published: (2025) -
Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input
by: Li, Chenxu, et al.
Published: (2025) -
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
by: Peng, Xiangyu, et al.
Published: (2025)