Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khan, Mohammad Abdul Hafeez, Jain, Yash, Bhattacharyya, Siddhartha, Vineet, Vibhav
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2507.22076
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908471257989120
author	Khan, Mohammad Abdul Hafeez Jain, Yash Bhattacharyya, Siddhartha Vineet, Vibhav
author_facet	Khan, Mohammad Abdul Hafeez Jain, Yash Bhattacharyya, Siddhartha Vineet, Vibhav
contents	Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_22076
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Test-time Prompt Refinement for Text-to-Image Models Khan, Mohammad Abdul Hafeez Jain, Yash Bhattacharyya, Siddhartha Vineet, Vibhav Machine Learning Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
title	Test-time Prompt Refinement for Text-to-Image Models
topic	Machine Learning
url	https://arxiv.org/abs/2507.22076

Similar Items