MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Broadbent, Dominic, Whiteley, Nick, Allison, Robert, Lovett, Tom
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Machine Learning Methodology
Accesso online:	https://arxiv.org/abs/2509.17543
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866914282437869568
author	Broadbent, Dominic Whiteley, Nick Allison, Robert Lovett, Tom
author_facet	Broadbent, Dominic Whiteley, Nick Allison, Robert Lovett, Tom
contents	Existing distribution compression methods reduce the number of observations in a dataset by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which we introduce to quantify the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that BDC can achieve comparable or superior downstream task performance to ambient-space compression at substantially lower cost and with significantly higher rates of compression.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_17543
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality Broadbent, Dominic Whiteley, Nick Allison, Robert Lovett, Tom Machine Learning Methodology Existing distribution compression methods reduce the number of observations in a dataset by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which we introduce to quantify the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that BDC can achieve comparable or superior downstream task performance to ambient-space compression at substantially lower cost and with significantly higher rates of compression.
title	Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality
topic	Machine Learning Methodology
url	https://arxiv.org/abs/2509.17543

Documenti analoghi