Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xiong, Zhen, Li, Yuqi, Yang, Chuanguang, Tan, Tiao, Zhu, Zhihong, Li, Siyuan, Ma, Yue
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2501.07070
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917891268411392
author	Xiong, Zhen Li, Yuqi Yang, Chuanguang Tan, Tiao Zhu, Zhihong Li, Siyuan Ma, Yue
author_facet	Xiong, Zhen Li, Yuqi Yang, Chuanguang Tan, Tiao Zhu, Zhihong Li, Siyuan Ma, Yue
contents	The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_07070
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Enhancing Image Generation Fidelity via Progressive Prompts Xiong, Zhen Li, Yuqi Yang, Chuanguang Tan, Tiao Zhu, Zhihong Li, Siyuan Ma, Yue Computer Vision and Pattern Recognition The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.
title	Enhancing Image Generation Fidelity via Progressive Prompts
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2501.07070

Similar Items