Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dong, Ziyi, Xiao, Yao, Wei, Pengxu, Lin, Liang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2502.04412
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909482193256448
author	Dong, Ziyi Xiao, Yao Wei, Pengxu Lin, Liang
author_facet	Dong, Ziyi Xiao, Yao Wei, Pengxu Lin, Liang
contents	Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the text prompt input with a pre-trained encoder structure, which is usually trained on a limited number of image-caption pairs. The state-of-the-art large language models (LLMs) based on the decoder-only structure have shown a powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models, and devise a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. Meanwhile, we also provide a supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only), and conduct extensive empirical evaluations to verify its effectiveness. The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_04412
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Decoder-Only LLMs are Better Controllers for Diffusion Models Dong, Ziyi Xiao, Yao Wei, Pengxu Lin, Liang Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the text prompt input with a pre-trained encoder structure, which is usually trained on a limited number of image-caption pairs. The state-of-the-art large language models (LLMs) based on the decoder-only structure have shown a powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models, and devise a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. Meanwhile, we also provide a supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only), and conduct extensive empirical evaluations to verify its effectiveness. The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
title	Decoder-Only LLMs are Better Controllers for Diffusion Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2502.04412

Similar Items