Saved in:
| Main Authors: | Hu, Yifan, Liang, Frank, Zhao, Dachuan, Geuter, Jonathan, Reddy, Varshini, Schmidt, Craig W., Tanner, Chris |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.15889 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
by: Schmidt, Craig W., et al.
Published: (2025)
by: Schmidt, Craig W., et al.
Published: (2025)
How Much is Enough? The Diminishing Returns of Tokenization Training Data
by: Reddy, Varshini, et al.
Published: (2025)
by: Reddy, Varshini, et al.
Published: (2025)
Tokenization with Split Trees
by: Schmidt, Craig W., et al.
Published: (2026)
by: Schmidt, Craig W., et al.
Published: (2026)
Tokenization Is More Than Compression
by: Schmidt, Craig W., et al.
Published: (2024)
by: Schmidt, Craig W., et al.
Published: (2024)
The Effect of Scripts and Formats on LLM Numeracy
by: Reddy, Varshini, et al.
Published: (2026)
by: Reddy, Varshini, et al.
Published: (2026)
Faster Superword Tokenization
by: Schmidt, Craig W., et al.
Published: (2026)
by: Schmidt, Craig W., et al.
Published: (2026)
SEC-QA: A Systematic Evaluation Corpus for Financial QA
by: Lai, Viet Dac, et al.
Published: (2024)
by: Lai, Viet Dac, et al.
Published: (2024)
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
by: Uzan, Omri, et al.
Published: (2024)
by: Uzan, Omri, et al.
Published: (2024)
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
by: Krumdick, Michael, et al.
Published: (2025)
by: Krumdick, Michael, et al.
Published: (2025)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)
by: Zhang, Wanpeng, et al.
Published: (2024)
BizBench: A Quantitative Reasoning Benchmark for Business and Finance
by: Koncel-Kedziorski, Rik, et al.
Published: (2023)
by: Koncel-Kedziorski, Rik, et al.
Published: (2023)
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
by: Foroutan, Negar, et al.
Published: (2025)
by: Foroutan, Negar, et al.
Published: (2025)
Theoretical Analysis of Byte-Pair Encoding
by: Kozma, László, et al.
Published: (2024)
by: Kozma, László, et al.
Published: (2024)
A Formal Perspective on Byte-Pair Encoding
by: Zouhar, Vilém, et al.
Published: (2023)
by: Zouhar, Vilém, et al.
Published: (2023)
Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
by: Sapkota, Ganesh, et al.
Published: (2025)
by: Sapkota, Ganesh, et al.
Published: (2025)
Language Models over Canonical Byte-Pair Encodings
by: Vieira, Tim, et al.
Published: (2025)
by: Vieira, Tim, et al.
Published: (2025)
DocFinQA: A Long-Context Financial Reasoning Dataset
by: Reddy, Varshini, et al.
Published: (2024)
by: Reddy, Varshini, et al.
Published: (2024)
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal
by: Lian, Haoran, et al.
Published: (2024)
by: Lian, Haoran, et al.
Published: (2024)
On Finding Inconsistencies in Documents
by: Lovering, Charles J., et al.
Published: (2025)
by: Lovering, Charles J., et al.
Published: (2025)
Byte BPE Tokenization as an Inverse string Homomorphism
by: Geng, Saibo, et al.
Published: (2024)
by: Geng, Saibo, et al.
Published: (2024)
Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
by: Yadav, Saumitra, et al.
Published: (2025)
by: Yadav, Saumitra, et al.
Published: (2025)
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
by: Samin, Ahnaf Mozib
Published: (2024)
by: Samin, Ahnaf Mozib
Published: (2024)
Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices
by: Zai, Liu, et al.
Published: (2026)
by: Zai, Liu, et al.
Published: (2026)
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
by: Krumdick, Michael, et al.
Published: (2026)
by: Krumdick, Michael, et al.
Published: (2026)
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
by: Limisiewicz, Tomasz, et al.
Published: (2024)
by: Limisiewicz, Tomasz, et al.
Published: (2024)
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
by: Jang, Eugene, et al.
Published: (2024)
by: Jang, Eugene, et al.
Published: (2024)
Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation
by: Le, Dinh-Viet-Toan, et al.
Published: (2024)
by: Le, Dinh-Viet-Toan, et al.
Published: (2024)
Back to Bytes: Revisiting Tokenization Through UTF-8
by: Moryossef, Amit, et al.
Published: (2025)
by: Moryossef, Amit, et al.
Published: (2025)
Distilling Token-Trained Models into Byte-Level Models
by: Bao, Zishuo, et al.
Published: (2026)
by: Bao, Zishuo, et al.
Published: (2026)
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
by: Deng, Chunyuan, et al.
Published: (2026)
by: Deng, Chunyuan, et al.
Published: (2026)
An Analysis of Multilingual FActScore
by: Vu, Kim Trong, et al.
Published: (2024)
by: Vu, Kim Trong, et al.
Published: (2024)
Byte Latent Transformer: Patches Scale Better Than Tokens
by: Pagnoni, Artidoro, et al.
Published: (2024)
by: Pagnoni, Artidoro, et al.
Published: (2024)
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
by: Shi, Dachuan, et al.
Published: (2023)
by: Shi, Dachuan, et al.
Published: (2023)
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
by: Singh, Avyav Kumar, et al.
Published: (2026)
by: Singh, Avyav Kumar, et al.
Published: (2026)
ByteSpan: Information-Driven Subword Tokenisation
by: Goriely, Zébulon, et al.
Published: (2025)
by: Goriely, Zébulon, et al.
Published: (2025)
LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
by: Lugoloobi, William, et al.
Published: (2026)
by: Lugoloobi, William, et al.
Published: (2026)
MambaByte: Token-free Selective State Space Model
by: Wang, Junxiong, et al.
Published: (2024)
by: Wang, Junxiong, et al.
Published: (2024)
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
by: Gautam, Aayush, et al.
Published: (2025)
by: Gautam, Aayush, et al.
Published: (2025)
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
by: Fountzoulas, George
Published: (2026)
by: Fountzoulas, George
Published: (2026)
GPUTOK: GPU Accelerated Byte Level BPE Tokenization
by: Kadamba, Venu Gopal, et al.
Published: (2026)
by: Kadamba, Venu Gopal, et al.
Published: (2026)
Similar Items
-
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
by: Schmidt, Craig W., et al.
Published: (2025) -
How Much is Enough? The Diminishing Returns of Tokenization Training Data
by: Reddy, Varshini, et al.
Published: (2025) -
Tokenization with Split Trees
by: Schmidt, Craig W., et al.
Published: (2026) -
Tokenization Is More Than Compression
by: Schmidt, Craig W., et al.
Published: (2024) -
The Effect of Scripts and Formats on LLM Numeracy
by: Reddy, Varshini, et al.
Published: (2026)