Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xia, Tianxiang, Xiao, Lin, Montorfani, Yannick, Pavia, Francesco, Simsar, Enis, Hofmann, Thomas
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2501.09055
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916567526146048
author	Xia, Tianxiang Xiao, Lin Montorfani, Yannick Pavia, Francesco Simsar, Enis Hofmann, Thomas
author_facet	Xia, Tianxiang Xiao, Lin Montorfani, Yannick Pavia, Francesco Simsar, Enis Hofmann, Thomas
contents	In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and "contrast but link" technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: https://polybox.ethz.ch/index.php/s/dJm3SWyRohUrFxn
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_09055
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation Xia, Tianxiang Xiao, Lin Montorfani, Yannick Pavia, Francesco Simsar, Enis Hofmann, Thomas Computer Vision and Pattern Recognition In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and "contrast but link" technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: https://polybox.ethz.ch/index.php/s/dJm3SWyRohUrFxn
title	SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2501.09055

Similar Items