Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Sun, Siyi, Selby, David Antony, Huang, Yunchuan, Vollmer, Sebastian, Flaxman, Seth, Calinescu, Anisoara
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Machine Learning Computational Engineering, Finance, and Science
Online-Zugang:	https://arxiv.org/abs/2506.08844
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866915335412645888
author	Sun, Siyi Selby, David Antony Huang, Yunchuan Vollmer, Sebastian Flaxman, Seth Calinescu, Anisoara
author_facet	Sun, Siyi Selby, David Antony Huang, Yunchuan Vollmer, Sebastian Flaxman, Seth Calinescu, Anisoara
contents	Missing data imputation in tabular datasets remains a pivotal challenge in data science and machine learning, particularly within socioeconomic research. However, real-world socioeconomic datasets are typically subject to strict data protection protocols, which often prohibit public sharing, even for synthetic derivatives. This severely limits the reproducibility and accessibility of benchmark studies in such settings. Further, there are very few publicly available synthetic datasets. Thus, there is limited availability of benchmarks for systematic evaluation of imputation methods on socioeconomic datasets, whether real or synthetic. In this study, we utilize the World Bank's publicly available synthetic dataset, Synthetic Data for an Imaginary Country, which closely mimics a real World Bank household survey while being fully public, enabling broad access for methodological research. With this as a starting point, we derived the IMAGIC-500 dataset: we select a subset of 500k individuals across approximately 100k households with 19 socioeconomic features, designed to reflect the hierarchical structure of real-world household surveys. This paper introduces a comprehensive missing data imputation benchmark on IMAGIC-500 under various missing mechanisms (MCAR, MAR, MNAR) and missingness ratios (10\%, 20\%, 30\%, 40\%, 50\%). Our evaluation considers the imputation accuracy for continuous and categorical variables, computational efficiency, and impact on downstream predictive tasks, such as estimating educational attainment at the individual level. The results highlight the strengths and weaknesses of statistical, traditional machine learning, and deep learning imputation techniques, including recent diffusion-based methods. The IMAGIC-500 dataset and benchmark aim to facilitate the development of robust imputation algorithms and foster reproducible social science research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_08844
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	IMAGIC-500: IMputation benchmark on A Generative Imaginary Country (500k samples) Sun, Siyi Selby, David Antony Huang, Yunchuan Vollmer, Sebastian Flaxman, Seth Calinescu, Anisoara Machine Learning Computational Engineering, Finance, and Science Missing data imputation in tabular datasets remains a pivotal challenge in data science and machine learning, particularly within socioeconomic research. However, real-world socioeconomic datasets are typically subject to strict data protection protocols, which often prohibit public sharing, even for synthetic derivatives. This severely limits the reproducibility and accessibility of benchmark studies in such settings. Further, there are very few publicly available synthetic datasets. Thus, there is limited availability of benchmarks for systematic evaluation of imputation methods on socioeconomic datasets, whether real or synthetic. In this study, we utilize the World Bank's publicly available synthetic dataset, Synthetic Data for an Imaginary Country, which closely mimics a real World Bank household survey while being fully public, enabling broad access for methodological research. With this as a starting point, we derived the IMAGIC-500 dataset: we select a subset of 500k individuals across approximately 100k households with 19 socioeconomic features, designed to reflect the hierarchical structure of real-world household surveys. This paper introduces a comprehensive missing data imputation benchmark on IMAGIC-500 under various missing mechanisms (MCAR, MAR, MNAR) and missingness ratios (10\%, 20\%, 30\%, 40\%, 50\%). Our evaluation considers the imputation accuracy for continuous and categorical variables, computational efficiency, and impact on downstream predictive tasks, such as estimating educational attainment at the individual level. The results highlight the strengths and weaknesses of statistical, traditional machine learning, and deep learning imputation techniques, including recent diffusion-based methods. The IMAGIC-500 dataset and benchmark aim to facilitate the development of robust imputation algorithms and foster reproducible social science research.
title	IMAGIC-500: IMputation benchmark on A Generative Imaginary Country (500k samples)
topic	Machine Learning Computational Engineering, Finance, and Science
url	https://arxiv.org/abs/2506.08844

Ähnliche Einträge