Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dell'Erba, Samuele, Bagdanov, Andrew D.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning I.4.9; I.2.10; I.2.7
Online Access:	https://arxiv.org/abs/2511.20821
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911340374786048
author	Dell'Erba, Samuele Bagdanov, Andrew D.
author_facet	Dell'Erba, Samuele Bagdanov, Andrew D.
contents	Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and zero-shot alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with the input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective. It achieves quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, underscoring the potential of optimization-based strategies as viable, training-free alternatives to traditional priors. The code will be publicly available upon acceptance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_20821
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion Dell'Erba, Samuele Bagdanov, Andrew D. Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning I.4.9; I.2.10; I.2.7 Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and zero-shot alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with the input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective. It achieves quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, underscoring the potential of optimization-based strategies as viable, training-free alternatives to traditional priors. The code will be publicly available upon acceptance.
title	Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning I.4.9; I.2.10; I.2.7
url	https://arxiv.org/abs/2511.20821

Similar Items