MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Niu, Yuwei, Ning, Munan, Zheng, Mengren, Jin, Weiyang, Lin, Bin, Jin, Peng, Liao, Jiaqi, Feng, Chaoran, Ning, Kunpeng, Zhu, Bin, Yuan, Li
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language I.2.7; I.2.10; I.4.9
Accesso online:	https://arxiv.org/abs/2503.07265
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866909912659918848
author	Niu, Yuwei Ning, Munan Zheng, Mengren Jin, Weiyang Lin, Bin Jin, Peng Liao, Jiaqi Feng, Chaoran Ning, Kunpeng Zhu, Bin Yuan, Li
author_facet	Niu, Yuwei Ning, Munan Zheng, Mengren Jin, Weiyang Lin, Bin Jin, Peng Liao, Jiaqi Feng, Chaoran Ning, Kunpeng Zhu, Bin Yuan, Li
contents	Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_07265
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation Niu, Yuwei Ning, Munan Zheng, Mengren Jin, Weiyang Lin, Bin Jin, Peng Liao, Jiaqi Feng, Chaoran Ning, Kunpeng Zhu, Bin Yuan, Li Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language I.2.7; I.2.10; I.4.9 Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.
title	WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language I.2.7; I.2.10; I.4.9
url	https://arxiv.org/abs/2503.07265

Documenti analoghi