Saved in:
Bibliographic Details
Main Authors: Gao, Zitian, Luo, Haoming, Chen, Lynx, Liu, Jason Klein, Tao, Ran, Zhou, Joey, Dai, Bryan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.04071
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909825524301824
author Gao, Zitian
Luo, Haoming
Chen, Lynx
Liu, Jason Klein
Tao, Ran
Zhou, Joey
Dai, Bryan
author_facet Gao, Zitian
Luo, Haoming
Chen, Lynx
Liu, Jason Klein
Tao, Ran
Zhou, Joey
Dai, Bryan
contents Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.
format Preprint
id arxiv_https___arxiv_org_abs_2510_04071
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle What Makes Diffusion Language Models Super Data Learners?
Gao, Zitian
Luo, Haoming
Chen, Lynx
Liu, Jason Klein
Tao, Ran
Zhou, Joey
Dai, Bryan
Computation and Language
Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.
title What Makes Diffusion Language Models Super Data Learners?
topic Computation and Language
url https://arxiv.org/abs/2510.04071