Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Degeorge, L., Ghosh, A., Dufour, N., Picard, D., Kalogeiton, V.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.21318
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918152728739840
author	Degeorge, L. Ghosh, A. Dufour, N. Picard, D. Kalogeiton, V.
author_facet	Degeorge, L. Ghosh, A. Dufour, N. Picard, D. Kalogeiton, V.
contents	Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_21318
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	How far can we go with ImageNet for Text-to-Image generation? Degeorge, L. Ghosh, A. Dufour, N. Picard, D. Kalogeiton, V. Computer Vision and Pattern Recognition Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.
title	How far can we go with ImageNet for Text-to-Image generation?
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.21318

Similar Items