:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chizhov, Pavel, Arnett, Catherine, Korotkova, Elizaveta, Yamshchikov, Ivan P.
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2409.04599
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution
by: Chizhov, Pavel, et al.
Published: (2026)

Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
by: Purason, Taido, et al.
Published: (2025)

Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models
by: Sorokovikova, Aleksandra, et al.
Published: (2025)

Toxicity of the Commons: Curating Open-Source Pre-Training Data
by: Arnett, Catherine, et al.
Published: (2024)

What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
by: Chizhov, Pavel, et al.
Published: (2025)

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
by: Land, Sander, et al.
Published: (2025)

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
by: Langlais, Pierre-Carl, et al.
Published: (2025)

The Company You Keep: How LLMs Respond to Dark Triad Traits
by: Lu, Zeyi, et al.
Published: (2026)

Vocabulary Transfer for Biomedical Texts: Add Tokens if You Can Not Add Data
by: Singh, Priyanka, et al.
Published: (2022)

Model in Distress: Sentiment Analysis on French Synthetic Social Media
by: Langlais, Pierre-Carl, et al.
Published: (2026)

Beyond Toxic: Toxicity Detection Datasets are Not Enough for Brand Safety
by: Korotkova, Elizaveta, et al.
Published: (2023)

BlockBPE: Parallel BPE Tokenization
by: You, Amos
Published: (2025)

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models
by: Balde, Gunjan, et al.
Published: (2024)

Batching BPE Tokenization Merges
by: Morgan, Alexander P.
Published: (2024)

CleanComedy: Creating Friendly Humor through Generative Techniques
by: Vikhorev, Dmitry, et al.
Published: (2024)

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
by: Cognetta, Marco, et al.
Published: (2024)

Constructing a BPE Tokenization DFA
by: Berglund, Martin, et al.
Published: (2024)

Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family
by: Langlais, Pierre-Carl, et al.
Published: (2025)

Byte BPE Tokenization as an Inverse string Homomorphism
by: Geng, Saibo, et al.
Published: (2024)

Evaluating Morphological Alignment of Tokenizers in 70 Languages
by: Arnett, Catherine, et al.
Published: (2025)

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
by: Sun, Yike, et al.
Published: (2026)

From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
by: Dolga, Rares, et al.
Published: (2025)

AdaptBPE: From General Purpose to Specialized Tokenizers
by: Liyanage, Vijini, et al.
Published: (2026)

Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction
by: Michaelov, James A., et al.
Published: (2025)

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
by: Asgari, Ehsaneddin, et al.
Published: (2025)

Fine-Tuning Transformers: Vocabulary Transfer
by: Mosin, Vladislav, et al.
Published: (2021)

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
by: Hayase, Jonathan, et al.
Published: (2024)

Weight Tying Biases Token Embeddings Towards the Output Space
by: Lopardo, Antonio, et al.
Published: (2026)

Transfer of Structural Knowledge from Synthetic Languages
by: Budnikov, Mikhail, et al.
Published: (2025)

Explaining and Mitigating Crosslingual Tokenizer Inequities
by: Arnett, Catherine, et al.
Published: (2025)

Sui Generis: Large Language Models for Authorship Attribution and Verification in Latin
by: Schmidt, Gleb, et al.
Published: (2024)

ComicScene154: A Scene Dataset for Comic Analysis
by: Paval, Sandro, et al.
Published: (2025)

GPUTOK: GPU Accelerated Byte Level BPE Tokenization
by: Kadamba, Venu Gopal, et al.
Published: (2026)

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
by: Arnett, Catherine, et al.
Published: (2024)

Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
by: Patwary, Firoj Ahmmed, et al.
Published: (2025)

What is Wrong with Language Models that Can Not Tell a Story?
by: Yamshchikov, Ivan P., et al.
Published: (2022)

Knowledge Graph Representation for Political Information Sources
by: Osmonova, Tinatin, et al.
Published: (2024)

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
by: Vemula, Saketh Reddy, et al.
Published: (2025)

Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
by: Sawada, Tomohiro, et al.
Published: (2025)

Do Data-based Curricula Work?
by: Surkov, Maxim K., et al.
Published: (2021)