MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Goswami, Ashish, Modi, Satyam Kumar, Deshineni, Santhosh Rishi, Singh, Harman, P, Prathosh A., Singla, Parag
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Computer Vision and Pattern Recognition
Accesso online:	https://arxiv.org/abs/2412.06089
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866910869196111872
author	Goswami, Ashish Modi, Satyam Kumar Deshineni, Santhosh Rishi Singh, Harman P, Prathosh A. Singla, Parag
author_facet	Goswami, Ashish Modi, Satyam Kumar Deshineni, Santhosh Rishi Singh, Harman P, Prathosh A. Singla, Parag
contents	Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. $\href{https://dair-iitd.github.io/GraPE/}{https://dair-iitd.github.io/GraPE/}$
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_06089
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis Goswami, Ashish Modi, Satyam Kumar Deshineni, Santhosh Rishi Singh, Harman P, Prathosh A. Singla, Parag Computer Vision and Pattern Recognition Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. $\href{https://dair-iitd.github.io/GraPE/}{https://dair-iitd.github.io/GraPE/}$
title	GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2412.06089

Documenti analoghi