Saved in:
Bibliographic Details
Main Author: Ovcharov, Volodymyr
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.24718
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914596994940928
author Ovcharov, Volodymyr
author_facet Ovcharov, Volodymyr
contents Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for the continent. The tax spans 2.5x from English (1.2 tokens/word) to Greek/Maltese (~3.1), following a clear hierarchy: Romance (1.5-1.7), Germanic (1.7-1.9), Slavic (2.2-2.5), Uralic/Baltic (2.7-3.0). Ukrainian (2.7) pays 15-18% more than cognate Slavic languages, reflecting underrepresentation in pre-training data. Fertility rankings are domain-invariant across three text registers (rho > 0.97). A subword analysis reveals that high-fertility tokenizers fragment morphological boundaries rather than preserving them. Cross-lingual few-shot evaluation on four Slavic languages shows that few-shot effects are model-intrinsic, not language-dependent. We release all measurements as a public dataset.
format Preprint
id arxiv_https___arxiv_org_abs_2605_24718
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty
Ovcharov, Volodymyr
Computation and Language
Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for the continent. The tax spans 2.5x from English (1.2 tokens/word) to Greek/Maltese (~3.1), following a clear hierarchy: Romance (1.5-1.7), Germanic (1.7-1.9), Slavic (2.2-2.5), Uralic/Baltic (2.7-3.0). Ukrainian (2.7) pays 15-18% more than cognate Slavic languages, reflecting underrepresentation in pre-training data. Fertility rankings are domain-invariant across three text registers (rho > 0.97). A subword analysis reveals that high-fertility tokenizers fragment morphological boundaries rather than preserving them. Cross-lingual few-shot evaluation on four Slavic languages shows that few-shot effects are model-intrinsic, not language-dependent. We release all measurements as a public dataset.
title The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty
topic Computation and Language
url https://arxiv.org/abs/2605.24718