Saved in:
| Main Authors: | Chizhov, Pavel, Arnett, Catherine, Korotkova, Elizaveta, Yamshchikov, Ivan P. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.04599 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution
by: Chizhov, Pavel, et al.
Published: (2026)
by: Chizhov, Pavel, et al.
Published: (2026)
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
by: Purason, Taido, et al.
Published: (2025)
by: Purason, Taido, et al.
Published: (2025)
Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models
by: Sorokovikova, Aleksandra, et al.
Published: (2025)
by: Sorokovikova, Aleksandra, et al.
Published: (2025)
Toxicity of the Commons: Curating Open-Source Pre-Training Data
by: Arnett, Catherine, et al.
Published: (2024)
by: Arnett, Catherine, et al.
Published: (2024)
What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
by: Chizhov, Pavel, et al.
Published: (2025)
by: Chizhov, Pavel, et al.
Published: (2025)
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
by: Land, Sander, et al.
Published: (2025)
by: Land, Sander, et al.
Published: (2025)
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
by: Langlais, Pierre-Carl, et al.
Published: (2025)
by: Langlais, Pierre-Carl, et al.
Published: (2025)
The Company You Keep: How LLMs Respond to Dark Triad Traits
by: Lu, Zeyi, et al.
Published: (2026)
by: Lu, Zeyi, et al.
Published: (2026)
Vocabulary Transfer for Biomedical Texts: Add Tokens if You Can Not Add Data
by: Singh, Priyanka, et al.
Published: (2022)
by: Singh, Priyanka, et al.
Published: (2022)
Model in Distress: Sentiment Analysis on French Synthetic Social Media
by: Langlais, Pierre-Carl, et al.
Published: (2026)
by: Langlais, Pierre-Carl, et al.
Published: (2026)
Beyond Toxic: Toxicity Detection Datasets are Not Enough for Brand Safety
by: Korotkova, Elizaveta, et al.
Published: (2023)
by: Korotkova, Elizaveta, et al.
Published: (2023)
BlockBPE: Parallel BPE Tokenization
by: You, Amos
Published: (2025)
by: You, Amos
Published: (2025)
Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models
by: Balde, Gunjan, et al.
Published: (2024)
by: Balde, Gunjan, et al.
Published: (2024)
Batching BPE Tokenization Merges
by: Morgan, Alexander P.
Published: (2024)
by: Morgan, Alexander P.
Published: (2024)
CleanComedy: Creating Friendly Humor through Generative Techniques
by: Vikhorev, Dmitry, et al.
Published: (2024)
by: Vikhorev, Dmitry, et al.
Published: (2024)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
by: Cognetta, Marco, et al.
Published: (2024)
by: Cognetta, Marco, et al.
Published: (2024)
Constructing a BPE Tokenization DFA
by: Berglund, Martin, et al.
Published: (2024)
by: Berglund, Martin, et al.
Published: (2024)
Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family
by: Langlais, Pierre-Carl, et al.
Published: (2025)
by: Langlais, Pierre-Carl, et al.
Published: (2025)
Byte BPE Tokenization as an Inverse string Homomorphism
by: Geng, Saibo, et al.
Published: (2024)
by: Geng, Saibo, et al.
Published: (2024)
Evaluating Morphological Alignment of Tokenizers in 70 Languages
by: Arnett, Catherine, et al.
Published: (2025)
by: Arnett, Catherine, et al.
Published: (2025)
LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
by: Sun, Yike, et al.
Published: (2026)
by: Sun, Yike, et al.
Published: (2026)
From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
by: Dolga, Rares, et al.
Published: (2025)
by: Dolga, Rares, et al.
Published: (2025)
AdaptBPE: From General Purpose to Specialized Tokenizers
by: Liyanage, Vijini, et al.
Published: (2026)
by: Liyanage, Vijini, et al.
Published: (2026)
Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction
by: Michaelov, James A., et al.
Published: (2025)
by: Michaelov, James A., et al.
Published: (2025)
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
by: Asgari, Ehsaneddin, et al.
Published: (2025)
by: Asgari, Ehsaneddin, et al.
Published: (2025)
Fine-Tuning Transformers: Vocabulary Transfer
by: Mosin, Vladislav, et al.
Published: (2021)
by: Mosin, Vladislav, et al.
Published: (2021)
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
by: Hayase, Jonathan, et al.
Published: (2024)
by: Hayase, Jonathan, et al.
Published: (2024)
Weight Tying Biases Token Embeddings Towards the Output Space
by: Lopardo, Antonio, et al.
Published: (2026)
by: Lopardo, Antonio, et al.
Published: (2026)
Transfer of Structural Knowledge from Synthetic Languages
by: Budnikov, Mikhail, et al.
Published: (2025)
by: Budnikov, Mikhail, et al.
Published: (2025)
Explaining and Mitigating Crosslingual Tokenizer Inequities
by: Arnett, Catherine, et al.
Published: (2025)
by: Arnett, Catherine, et al.
Published: (2025)
Sui Generis: Large Language Models for Authorship Attribution and Verification in Latin
by: Schmidt, Gleb, et al.
Published: (2024)
by: Schmidt, Gleb, et al.
Published: (2024)
ComicScene154: A Scene Dataset for Comic Analysis
by: Paval, Sandro, et al.
Published: (2025)
by: Paval, Sandro, et al.
Published: (2025)
GPUTOK: GPU Accelerated Byte Level BPE Tokenization
by: Kadamba, Venu Gopal, et al.
Published: (2026)
by: Kadamba, Venu Gopal, et al.
Published: (2026)
Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
by: Arnett, Catherine, et al.
Published: (2024)
by: Arnett, Catherine, et al.
Published: (2024)
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
by: Patwary, Firoj Ahmmed, et al.
Published: (2025)
by: Patwary, Firoj Ahmmed, et al.
Published: (2025)
What is Wrong with Language Models that Can Not Tell a Story?
by: Yamshchikov, Ivan P., et al.
Published: (2022)
by: Yamshchikov, Ivan P., et al.
Published: (2022)
Knowledge Graph Representation for Political Information Sources
by: Osmonova, Tinatin, et al.
Published: (2024)
by: Osmonova, Tinatin, et al.
Published: (2024)
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
by: Vemula, Saketh Reddy, et al.
Published: (2025)
by: Vemula, Saketh Reddy, et al.
Published: (2025)
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
by: Sawada, Tomohiro, et al.
Published: (2025)
by: Sawada, Tomohiro, et al.
Published: (2025)
Do Data-based Curricula Work?
by: Surkov, Maxim K., et al.
Published: (2021)
by: Surkov, Maxim K., et al.
Published: (2021)
Similar Items
-
From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution
by: Chizhov, Pavel, et al.
Published: (2026) -
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
by: Purason, Taido, et al.
Published: (2025) -
Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models
by: Sorokovikova, Aleksandra, et al.
Published: (2025) -
Toxicity of the Commons: Curating Open-Source Pre-Training Data
by: Arnett, Catherine, et al.
Published: (2024) -
What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
by: Chizhov, Pavel, et al.
Published: (2025)