Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Son, Moo Hyun, Oh, Jintaek, Mun, Sun Bin, Roh, Jaechul, Choi, Sehyun
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.04201
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909825596653568
author	Son, Moo Hyun Oh, Jintaek Mun, Sun Bin Roh, Jaechul Choi, Sehyun
author_facet	Son, Moo Hyun Oh, Jintaek Mun, Sun Bin Roh, Jaechul Choi, Sehyun
contents	While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_04201
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge Son, Moo Hyun Oh, Jintaek Mun, Sun Bin Roh, Jaechul Choi, Sehyun Computer Vision and Pattern Recognition Artificial Intelligence While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.
title	World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2510.04201

Similar Items