Saved in:
Bibliographic Details
Main Authors: Li, Bo, Yin, Yida, Chai, Wenhao, Fu, Xingyu, Liu, Zhuang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.22155
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915761379868672
author Li, Bo
Yin, Yida
Chai, Wenhao
Fu, Xingyu
Liu, Zhuang
author_facet Li, Bo
Yin, Yida
Chai, Wenhao
Fu, Xingyu
Liu, Zhuang
contents We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.
format Preprint
id arxiv_https___arxiv_org_abs_2601_22155
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle UEval: A Benchmark for Unified Multimodal Generation
Li, Bo
Yin, Yida
Chai, Wenhao
Fu, Xingyu
Liu, Zhuang
Computer Vision and Pattern Recognition
Computation and Language
We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.
title UEval: A Benchmark for Unified Multimodal Generation
topic Computer Vision and Pattern Recognition
Computation and Language
url https://arxiv.org/abs/2601.22155