Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ghosh, Adhiraj, Udandarao, Vishaal, Nguyen, Thao, Farina, Matteo, Cherti, Mehdi, Jitsev, Jenia, Oh, Sewoong, Ricci, Elisa, Schmidt, Ludwig, Bethge, Matthias
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2511.20643
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918218010984448
author	Ghosh, Adhiraj Udandarao, Vishaal Nguyen, Thao Farina, Matteo Cherti, Mehdi Jitsev, Jenia Oh, Sewoong Ricci, Elisa Schmidt, Ludwig Bethge, Matthias
author_facet	Ghosh, Adhiraj Udandarao, Vishaal Nguyen, Thao Farina, Matteo Cherti, Mehdi Jitsev, Jenia Oh, Sewoong Ricci, Elisa Schmidt, Ludwig Bethge, Matthias
contents	What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_20643
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Concept-Aware Batch Sampling Improves Language-Image Pretraining Ghosh, Adhiraj Udandarao, Vishaal Nguyen, Thao Farina, Matteo Cherti, Mehdi Jitsev, Jenia Oh, Sewoong Ricci, Elisa Schmidt, Ludwig Bethge, Matthias Computer Vision and Pattern Recognition Machine Learning What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
title	Concept-Aware Batch Sampling Improves Language-Image Pretraining
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2511.20643

Similar Items