Saved in:
Bibliographic Details
Main Authors: Chen, Tong, Selvan, Raghavendra
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.10367
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918139988541440
author Chen, Tong
Selvan, Raghavendra
author_facet Chen, Tong
Selvan, Raghavendra
contents Given a dataset of finitely many elements $\mathcal{T} = \{\mathbf{x}_i\}_{i = 1}^N$, the goal of dataset condensation (DC) is to construct a synthetic dataset $\mathcal{S} = \{\tilde{\mathbf{x}}_j\}_{j = 1}^M$ which is significantly smaller ($M \ll N$) such that a model trained from scratch on $\mathcal{S}$ achieves comparable or even superior generalization performance to a model trained on $\mathcal{T}$. Recent advances in DC reveal a close connection to the problem of approximating the data distribution represented by $\mathcal{T}$ with a reduced set of points. In this work, we present a unified framework that encompasses existing DC methods and extend the task-specific notion of DC to a more general and formal definition using notions of discrepancy, which quantify the distance between probability distribution in different regimes. Our framework broadens the objective of DC beyond generalization, accommodating additional objectives such as robustness, privacy, and other desirable properties.
format Preprint
id arxiv_https___arxiv_org_abs_2509_10367
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A Discrepancy-Based Perspective on Dataset Condensation
Chen, Tong
Selvan, Raghavendra
Machine Learning
Given a dataset of finitely many elements $\mathcal{T} = \{\mathbf{x}_i\}_{i = 1}^N$, the goal of dataset condensation (DC) is to construct a synthetic dataset $\mathcal{S} = \{\tilde{\mathbf{x}}_j\}_{j = 1}^M$ which is significantly smaller ($M \ll N$) such that a model trained from scratch on $\mathcal{S}$ achieves comparable or even superior generalization performance to a model trained on $\mathcal{T}$. Recent advances in DC reveal a close connection to the problem of approximating the data distribution represented by $\mathcal{T}$ with a reduced set of points. In this work, we present a unified framework that encompasses existing DC methods and extend the task-specific notion of DC to a more general and formal definition using notions of discrepancy, which quantify the distance between probability distribution in different regimes. Our framework broadens the objective of DC beyond generalization, accommodating additional objectives such as robustness, privacy, and other desirable properties.
title A Discrepancy-Based Perspective on Dataset Condensation
topic Machine Learning
url https://arxiv.org/abs/2509.10367