Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Celine, Yan, Jing Nathan, Liang, Chen, Shi, Jiaxin, Zhang, Yin, Liu, Jeremiah, Yin, Pengcheng, Pereira, Fernando, Chi, Ed, Cheng, Derek, Rush, Alexander M., Wang, Ruoxi
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.12928
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909039202402304
author	Lee, Celine Yan, Jing Nathan Liang, Chen Shi, Jiaxin Zhang, Yin Liu, Jeremiah Yin, Pengcheng Pereira, Fernando Chi, Ed Cheng, Derek Rush, Alexander M. Wang, Ruoxi
author_facet	Lee, Celine Yan, Jing Nathan Liang, Chen Shi, Jiaxin Zhang, Yin Liu, Jeremiah Yin, Pengcheng Pereira, Fernando Chi, Ed Cheng, Derek Rush, Alexander M. Wang, Ruoxi
contents	Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from context fragility: while AR's stable causal history allows models to naturally rediscover subword patterns, the MDM objective destroys the local contiguity required to efficiently resolve semantics from raw bytes. Our findings from controlled permutation experiments suggest that future modality-agnostic designs must incorporate alternative structural biases to maintain viable scaling trajectories in the byte regime.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_12928
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	The Efficiency Gap in Byte Modeling Lee, Celine Yan, Jing Nathan Liang, Chen Shi, Jiaxin Zhang, Yin Liu, Jeremiah Yin, Pengcheng Pereira, Fernando Chi, Ed Cheng, Derek Rush, Alexander M. Wang, Ruoxi Machine Learning Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from context fragility: while AR's stable causal history allows models to naturally rediscover subword patterns, the MDM objective destroys the local contiguity required to efficiently resolve semantics from raw bytes. Our findings from controlled permutation experiments suggest that future modality-agnostic designs must incorporate alternative structural biases to maintain viable scaling trajectories in the byte regime.
title	The Efficiency Gap in Byte Modeling
topic	Machine Learning
url	https://arxiv.org/abs/2605.12928

Similar Items