Salvato in:
| Autori principali: | , , , |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2025
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2504.01908 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866913773192740864 |
|---|---|
| author | Sidorenko, Andrey Platzer, Michael Scriminaci, Mario Tiwald, Paul |
| author_facet | Sidorenko, Andrey Platzer, Michael Scriminaci, Mario Tiwald, Paul |
| contents | Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at https://github.com/mostly-ai/mostlyai-qa. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_01908 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework Sidorenko, Andrey Platzer, Michael Scriminaci, Mario Tiwald, Paul Machine Learning Artificial Intelligence Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at https://github.com/mostly-ai/mostlyai-qa. |
| title | Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2504.01908 |