Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Szatkownik, Antoine, Decelle, Aurélien, Seoane, Beatriz, Bereux, Nicolas, Planche, Léo, Charpiat, Guillaume, Yelmen, Burak, Jay, Flora, Furtlehner, Cyril
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Machine Learning
Online-Zugang:	https://arxiv.org/abs/2510.24233
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866908617090793472
author	Szatkownik, Antoine Decelle, Aurélien Seoane, Beatriz Bereux, Nicolas Planche, Léo Charpiat, Guillaume Yelmen, Burak Jay, Flora Furtlehner, Cyril
author_facet	Szatkownik, Antoine Decelle, Aurélien Seoane, Beatriz Bereux, Nicolas Planche, Léo Charpiat, Guillaume Yelmen, Burak Jay, Flora Furtlehner, Cyril
contents	Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_24233
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	PRIVET: Privacy Metric Based on Extreme Value Theory Szatkownik, Antoine Decelle, Aurélien Seoane, Beatriz Bereux, Nicolas Planche, Léo Charpiat, Guillaume Yelmen, Burak Jay, Flora Furtlehner, Cyril Machine Learning Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
title	PRIVET: Privacy Metric Based on Extreme Value Theory
topic	Machine Learning
url	https://arxiv.org/abs/2510.24233

Ähnliche Einträge