Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Barr, Austin A., Rozman, Robert, Guo, Eddie
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2502.14523
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912239106129920
author	Barr, Austin A. Rozman, Robert Guo, Eddie
author_facet	Barr, Austin A. Rozman, Robert Guo, Eddie
contents	We propose a new framework for zero-shot generation of synthetic tabular data. Using the large language model (LLM) GPT-4o and plain-language prompting, we demonstrate the ability to generate high-fidelity tabular data without task-specific fine-tuning or access to real-world data (RWD) for pre-training. To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional tabular generative adversarial network (CTGAN), across three open-access datasets: Iris, Fish Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes. Notably, correlations between parameters were consistently preserved with appropriate direction and strength. However, refinement is necessary to better retain distributional characteristics. These findings highlight the potential of LLMs in tabular data synthesis, offering an accessible alternative to generative adversarial networks and variational autoencoders.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_14523
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation Barr, Austin A. Rozman, Robert Guo, Eddie Machine Learning Computation and Language We propose a new framework for zero-shot generation of synthetic tabular data. Using the large language model (LLM) GPT-4o and plain-language prompting, we demonstrate the ability to generate high-fidelity tabular data without task-specific fine-tuning or access to real-world data (RWD) for pre-training. To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional tabular generative adversarial network (CTGAN), across three open-access datasets: Iris, Fish Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes. Notably, correlations between parameters were consistently preserved with appropriate direction and strength. However, refinement is necessary to better retain distributional characteristics. These findings highlight the potential of LLMs in tabular data synthesis, offering an accessible alternative to generative adversarial networks and variational autoencoders.
title	Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2502.14523

Similar Items