Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Chieh-Yun, Tseng, Chiang, Tsao, Li-Wu, Shuai, Hong-Han
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2410.00321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909380225531904
author	Chen, Chieh-Yun Tseng, Chiang Tsao, Li-Wu Shuai, Hong-Han
author_facet	Chen, Chieh-Yun Tseng, Chiang Tsao, Li-Wu Shuai, Hong-Han
contents	This paper analyzes the impact of causal manner in the text encoder of text-to-image (T2I) diffusion models, which can lead to information bias and loss. Previous works have focused on addressing the issues through the denoising process. However, there is no research discussing how text embedding contributes to T2I models, especially when generating more than one object. In this paper, we share a comprehensive analysis of text embedding: i) how text embedding contributes to the generated images and ii) why information gets lost and biases towards the first-mentioned object. Accordingly, we propose a simple but effective text embedding balance optimization method, which is training-free, with an improvement of 125.42% on information balance in stable diffusion. Furthermore, we propose a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments. This metric effectively measures the presence and accuracy of objects, addressing the limitations of current distribution scores like CLIP's text-image similarities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_00321
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization Chen, Chieh-Yun Tseng, Chiang Tsao, Li-Wu Shuai, Hong-Han Computer Vision and Pattern Recognition This paper analyzes the impact of causal manner in the text encoder of text-to-image (T2I) diffusion models, which can lead to information bias and loss. Previous works have focused on addressing the issues through the denoising process. However, there is no research discussing how text embedding contributes to T2I models, especially when generating more than one object. In this paper, we share a comprehensive analysis of text embedding: i) how text embedding contributes to the generated images and ii) why information gets lost and biases towards the first-mentioned object. Accordingly, we propose a simple but effective text embedding balance optimization method, which is training-free, with an improvement of 125.42% on information balance in stable diffusion. Furthermore, we propose a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments. This metric effectively measures the presence and accuracy of objects, addressing the limitations of current distribution scores like CLIP's text-image similarities.
title	A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2410.00321

Similar Items