Saved in:
Bibliographic Details
Main Authors: Bakshi, Soham, Chakraborty, Sunrit
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.10531
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918343541260288
author Bakshi, Soham
Chakraborty, Sunrit
author_facet Bakshi, Soham
Chakraborty, Sunrit
contents The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
format Preprint
id arxiv_https___arxiv_org_abs_2602_10531
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources
Bakshi, Soham
Chakraborty, Sunrit
Machine Learning
Statistics Theory
The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
title From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources
topic Machine Learning
Statistics Theory
url https://arxiv.org/abs/2602.10531