Saved in:
Bibliographic Details
Main Authors: Jiang, Jialong, Hu, Wenkang, Huang, Jian, Jiao, Yuling, Liu, Xu
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.04992
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910932670611456
author Jiang, Jialong
Hu, Wenkang
Huang, Jian
Jiao, Yuling
Liu, Xu
author_facet Jiang, Jialong
Hu, Wenkang
Huang, Jian
Jiao, Yuling
Liu, Xu
contents The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited.
format Preprint
id arxiv_https___arxiv_org_abs_2505_04992
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Boosting Statistic Learning with Synthetic Data from Pretrained Large Models
Jiang, Jialong
Hu, Wenkang
Huang, Jian
Jiao, Yuling
Liu, Xu
Machine Learning
Applications
The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited.
title Boosting Statistic Learning with Synthetic Data from Pretrained Large Models
topic Machine Learning
Applications
url https://arxiv.org/abs/2505.04992