Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yi, Bingji, Liu, Qiyuan, Cheng, Yuwei, Xu, Haifeng
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.16657
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917316361453568
author	Yi, Bingji Liu, Qiyuan Cheng, Yuwei Xu, Haifeng
author_facet	Yi, Bingji Liu, Qiyuan Cheng, Yuwei Xu, Haifeng
contents	Synthetic data has been increasingly used to train frontier generative models. However, recent studies raise key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model performance, a phenomenon often coined model collapse. In this paper, we investigate ways to modify the synthetic retraining process to avoid model collapse, and even possibly help reverse the trend from collapse to improvement. Our key finding is that by injecting information through an external synthetic data verifier, whether a human or a better model, synthetic retraining will not cause model collapse. Specifically, we situate our theoretical analysis in the fundamental linear regression setting, showing that verifier-guided retraining can yield near-term improvements, but ultimately drives the parameter estimate to the verifier's "knowledge center" in the long run. Our theory further predicts that, unless the verifier is perfectly reliable, these early gains will plateau and may even reverse. Indeed, our experiments across linear regression, Variational Autoencoders (VAEs) trained on MNIST, and fining-tuning SmolLM2-135M on the XSUM task confirm these theoretical insights.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_16657
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence Yi, Bingji Liu, Qiyuan Cheng, Yuwei Xu, Haifeng Machine Learning Synthetic data has been increasingly used to train frontier generative models. However, recent studies raise key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model performance, a phenomenon often coined model collapse. In this paper, we investigate ways to modify the synthetic retraining process to avoid model collapse, and even possibly help reverse the trend from collapse to improvement. Our key finding is that by injecting information through an external synthetic data verifier, whether a human or a better model, synthetic retraining will not cause model collapse. Specifically, we situate our theoretical analysis in the fundamental linear regression setting, showing that verifier-guided retraining can yield near-term improvements, but ultimately drives the parameter estimate to the verifier's "knowledge center" in the long run. Our theory further predicts that, unless the verifier is perfectly reliable, these early gains will plateau and may even reverse. Indeed, our experiments across linear regression, Variational Autoencoders (VAEs) trained on MNIST, and fining-tuning SmolLM2-135M on the XSUM task confirm these theoretical insights.
title	Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence
topic	Machine Learning
url	https://arxiv.org/abs/2510.16657

Similar Items