Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Ling, Yu, Zhaochen, Meng, Chenlin, Xu, Minkai, Ermon, Stefano, Cui, Bin
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2401.11708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910471433486336
author	Yang, Ling Yu, Zhaochen Meng, Chenlin Xu, Minkai Ermon, Stefano Cui, Bin
author_facet	Yang, Ling Yu, Zhaochen Meng, Chenlin Xu, Minkai Ermon, Stefano Cui, Bin
contents	Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_11708
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs Yang, Ling Yu, Zhaochen Meng, Chenlin Xu, Minkai Ermon, Stefano Cui, Bin Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster
title	Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
topic	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2401.11708

Similar Items