Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.22947 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918534799425536 |
|---|---|
| author | Ye, Mengyu Kudo, Keito Takahashi, Ryosuke Suzuki, Jun |
| author_facet | Ye, Mengyu Kudo, Keito Takahashi, Ryosuke Suzuki, Jun |
| contents | Masked diffusion language models (MDLMs) generate text by unmasking tokens in parallel and have recently emerged as alternatives to autoregressive language models. They can be viewed as parallel decoders trained with a position-wise cross-entropy (CE) loss, the same setup as non-autoregressive translation (NAT). In NAT, CE-trained parallel decoders have been argued to be sensitive to small positional shifts, since CE penalizes them harshly. We ask whether CE-trained MDLMs are similarly sensitive to such shifts under iterative decoding. To probe this, we apply a controlled intervention that introduces them during decoding. On LLaDA-8B-Instruct with Arena-Hard, displacing as little as 1% of generated tokens by one position substantially reduces win rates against the unintervened model, showing that MDLMs are sensitive to such small shifts under iterative parallel decoding. Motivated by this, we adapt connectionist temporal classification (CTC), an alignment-flexible objective known to mitigate it there, to MDLM supervised fine-tuning. By relaxing the strict position-wise match that CE imposes, CTC gives the loss room to absorb small positional shifts; concretely, we modified CTC objective to use a special <slack> token that absorbs positional uncertainty between target tokens and output positions, and a updated collapse map that preserves target surface forms. Across four open-ended generation benchmarks, the resulting model consistently improves over both the original model and a matched cross-entropy-trained baseline, with statistically significant gains on all four. These results identify training-side alignment flexibility as a useful design dimension for MDLM SFT, complementary to the inference-time approaches explored in prior work. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_22947 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Reconsidering Positional Supervision in Masked Diffusion Language Model Training Ye, Mengyu Kudo, Keito Takahashi, Ryosuke Suzuki, Jun Computation and Language Machine Learning Masked diffusion language models (MDLMs) generate text by unmasking tokens in parallel and have recently emerged as alternatives to autoregressive language models. They can be viewed as parallel decoders trained with a position-wise cross-entropy (CE) loss, the same setup as non-autoregressive translation (NAT). In NAT, CE-trained parallel decoders have been argued to be sensitive to small positional shifts, since CE penalizes them harshly. We ask whether CE-trained MDLMs are similarly sensitive to such shifts under iterative decoding. To probe this, we apply a controlled intervention that introduces them during decoding. On LLaDA-8B-Instruct with Arena-Hard, displacing as little as 1% of generated tokens by one position substantially reduces win rates against the unintervened model, showing that MDLMs are sensitive to such small shifts under iterative parallel decoding. Motivated by this, we adapt connectionist temporal classification (CTC), an alignment-flexible objective known to mitigate it there, to MDLM supervised fine-tuning. By relaxing the strict position-wise match that CE imposes, CTC gives the loss room to absorb small positional shifts; concretely, we modified CTC objective to use a special <slack> token that absorbs positional uncertainty between target tokens and output positions, and a updated collapse map that preserves target surface forms. Across four open-ended generation benchmarks, the resulting model consistently improves over both the original model and a matched cross-entropy-trained baseline, with statistically significant gains on all four. These results identify training-side alignment flexibility as a useful design dimension for MDLM SFT, complementary to the inference-time approaches explored in prior work. |
| title | Reconsidering Positional Supervision in Masked Diffusion Language Model Training |
| topic | Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2601.22947 |