Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pernias, Pablo, Rampas, Dominic, Richter, Mats L., Pal, Christopher J., Aubreville, Marc
Format:	Preprint
Published:	2023
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2306.00637
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929364721991680
author	Pernias, Pablo Rampas, Dominic Richter, Mats L. Pal, Christopher J. Aubreville, Marc
author_facet	Pernias, Pablo Rampas, Dominic Richter, Mats L. Pal, Christopher J. Aubreville, Marc
contents	We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.
format	Preprint
id	arxiv_https___arxiv_org_abs_2306_00637
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models Pernias, Pablo Rampas, Dominic Richter, Mats L. Pal, Christopher J. Aubreville, Marc Computer Vision and Pattern Recognition We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.
title	Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2306.00637

Similar Items