Saved in:
Bibliographic Details
Main Authors: Lee, Celine, Yan, Jing Nathan, Liang, Chen, Shi, Jiaxin, Zhang, Yin, Liu, Jeremiah, Yin, Pengcheng, Pereira, Fernando, Chi, Ed, Cheng, Derek, Rush, Alexander M., Wang, Ruoxi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.12928
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909039202402304
author Lee, Celine
Yan, Jing Nathan
Liang, Chen
Shi, Jiaxin
Zhang, Yin
Liu, Jeremiah
Yin, Pengcheng
Pereira, Fernando
Chi, Ed
Cheng, Derek
Rush, Alexander M.
Wang, Ruoxi
author_facet Lee, Celine
Yan, Jing Nathan
Liang, Chen
Shi, Jiaxin
Zhang, Yin
Liu, Jeremiah
Yin, Pengcheng
Pereira, Fernando
Chi, Ed
Cheng, Derek
Rush, Alexander M.
Wang, Ruoxi
contents Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from context fragility: while AR's stable causal history allows models to naturally rediscover subword patterns, the MDM objective destroys the local contiguity required to efficiently resolve semantics from raw bytes. Our findings from controlled permutation experiments suggest that future modality-agnostic designs must incorporate alternative structural biases to maintain viable scaling trajectories in the byte regime.
format Preprint
id arxiv_https___arxiv_org_abs_2605_12928
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Efficiency Gap in Byte Modeling
Lee, Celine
Yan, Jing Nathan
Liang, Chen
Shi, Jiaxin
Zhang, Yin
Liu, Jeremiah
Yin, Pengcheng
Pereira, Fernando
Chi, Ed
Cheng, Derek
Rush, Alexander M.
Wang, Ruoxi
Machine Learning
Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from context fragility: while AR's stable causal history allows models to naturally rediscover subword patterns, the MDM objective destroys the local contiguity required to efficiently resolve semantics from raw bytes. Our findings from controlled permutation experiments suggest that future modality-agnostic designs must incorporate alternative structural biases to maintain viable scaling trajectories in the byte regime.
title The Efficiency Gap in Byte Modeling
topic Machine Learning
url https://arxiv.org/abs/2605.12928