Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Bo, Yin, Yida, Chai, Wenhao, Fu, Xingyu, Liu, Zhuang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2601.22155
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915761379868672
author	Li, Bo Yin, Yida Chai, Wenhao Fu, Xingyu Liu, Zhuang
author_facet	Li, Bo Yin, Yida Chai, Wenhao Fu, Xingyu Liu, Zhuang
contents	We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_22155
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	UEval: A Benchmark for Unified Multimodal Generation Li, Bo Yin, Yida Chai, Wenhao Fu, Xingyu Liu, Zhuang Computer Vision and Pattern Recognition Computation and Language We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.
title	UEval: A Benchmark for Unified Multimodal Generation
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2601.22155

Similar Items