Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bakshi, Soham, Chakraborty, Sunrit
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Statistics Theory
Online Access:	https://arxiv.org/abs/2602.10531
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918343541260288
author	Bakshi, Soham Chakraborty, Sunrit
author_facet	Bakshi, Soham Chakraborty, Sunrit
contents	The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_10531
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources Bakshi, Soham Chakraborty, Sunrit Machine Learning Statistics Theory The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
title	From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources
topic	Machine Learning Statistics Theory
url	https://arxiv.org/abs/2602.10531

Similar Items