Saved in:
| Main Author: | Fountzoulas, George |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.07969 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
by: Jang, Eugene, et al.
Published: (2024)
by: Jang, Eugene, et al.
Published: (2024)
Distilling Token-Trained Models into Byte-Level Models
by: Bao, Zishuo, et al.
Published: (2026)
by: Bao, Zishuo, et al.
Published: (2026)
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
by: Singh, Avyav Kumar, et al.
Published: (2026)
by: Singh, Avyav Kumar, et al.
Published: (2026)
GPUTOK: GPU Accelerated Byte Level BPE Tokenization
by: Kadamba, Venu Gopal, et al.
Published: (2026)
by: Kadamba, Venu Gopal, et al.
Published: (2026)
An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification
by: Nguyen, Ba-Quang
Published: (2025)
by: Nguyen, Ba-Quang
Published: (2025)
An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification
by: Rusli, Andre, et al.
Published: (2024)
by: Rusli, Andre, et al.
Published: (2024)
Token Masking Improves Transformer-Based Text Classification
by: Xu, Xianglong, et al.
Published: (2025)
by: Xu, Xianglong, et al.
Published: (2025)
Byte BPE Tokenization as an Inverse string Homomorphism
by: Geng, Saibo, et al.
Published: (2024)
by: Geng, Saibo, et al.
Published: (2024)
Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer
by: Gao, Jia, et al.
Published: (2025)
by: Gao, Jia, et al.
Published: (2025)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
by: Phan, Buu, et al.
Published: (2024)
by: Phan, Buu, et al.
Published: (2024)
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
by: Hu, Yifan, et al.
Published: (2025)
by: Hu, Yifan, et al.
Published: (2025)
Back to Bytes: Revisiting Tokenization Through UTF-8
by: Moryossef, Amit, et al.
Published: (2025)
by: Moryossef, Amit, et al.
Published: (2025)
Text Classification Based on Knowledge Graphs and Improved Attention Mechanism
by: Li, Siyu, et al.
Published: (2024)
by: Li, Siyu, et al.
Published: (2024)
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
by: Deng, Chunyuan, et al.
Published: (2026)
by: Deng, Chunyuan, et al.
Published: (2026)
Byte Latent Transformer: Patches Scale Better Than Tokens
by: Pagnoni, Artidoro, et al.
Published: (2024)
by: Pagnoni, Artidoro, et al.
Published: (2024)
Towards Token-Level Text Anomaly Detection
by: Cao, Yang, et al.
Published: (2026)
by: Cao, Yang, et al.
Published: (2026)
BanglaByT5: Byte-Level Modelling for Bangla
by: Bhattacharyya, Pramit, et al.
Published: (2025)
by: Bhattacharyya, Pramit, et al.
Published: (2025)
Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns
by: Mihaila, George
Published: (2026)
by: Mihaila, George
Published: (2026)
Token Prediction as Implicit Classification to Identify LLM-Generated Text
by: Chen, Yutian, et al.
Published: (2023)
by: Chen, Yutian, et al.
Published: (2023)
Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models
by: Shravan, Rohan
Published: (2026)
by: Shravan, Rohan
Published: (2026)
MambaByte: Token-free Selective State Space Model
by: Wang, Junxiong, et al.
Published: (2024)
by: Wang, Junxiong, et al.
Published: (2024)
Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
by: Xia, Han, et al.
Published: (2024)
by: Xia, Han, et al.
Published: (2024)
Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
by: Deng, Difan, et al.
Published: (2026)
by: Deng, Difan, et al.
Published: (2026)
Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
by: Wu, Chenwang, et al.
Published: (2026)
by: Wu, Chenwang, et al.
Published: (2026)
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
by: Gigant, Théo, et al.
Published: (2026)
by: Gigant, Théo, et al.
Published: (2026)
Multi-Token Attention
by: Golovneva, Olga, et al.
Published: (2025)
by: Golovneva, Olga, et al.
Published: (2025)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)
by: Zhang, Wanpeng, et al.
Published: (2024)
Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need
by: Skiredj, Abderrahman, et al.
Published: (2024)
by: Skiredj, Abderrahman, et al.
Published: (2024)
Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood
by: Lin, Xingyu, et al.
Published: (2026)
by: Lin, Xingyu, et al.
Published: (2026)
Advancing Text Classification with Large Language Models and Neural Attention Mechanisms
by: Lyu, Ning, et al.
Published: (2025)
by: Lyu, Ning, et al.
Published: (2025)
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
by: Yun, Jungmin, et al.
Published: (2024)
by: Yun, Jungmin, et al.
Published: (2024)
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
by: Slagle, Kevin
Published: (2024)
by: Slagle, Kevin
Published: (2024)
Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood
by: Lin, Xingyu, et al.
Published: (2025)
by: Lin, Xingyu, et al.
Published: (2025)
Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
by: Sapkota, Ganesh, et al.
Published: (2025)
by: Sapkota, Ganesh, et al.
Published: (2025)
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
by: Foroutan, Negar, et al.
Published: (2025)
by: Foroutan, Negar, et al.
Published: (2025)
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
by: Kallini, Julie, et al.
Published: (2024)
by: Kallini, Julie, et al.
Published: (2024)
Explainability-Based Token Replacement on LLM-Generated Text
by: Mohammadi, Hadi, et al.
Published: (2025)
by: Mohammadi, Hadi, et al.
Published: (2025)
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
by: Firestone, Preston, et al.
Published: (2025)
by: Firestone, Preston, et al.
Published: (2025)
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal
by: Lian, Haoran, et al.
Published: (2024)
by: Lian, Haoran, et al.
Published: (2024)
Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification
by: Akabe, Koichi, et al.
Published: (2024)
by: Akabe, Koichi, et al.
Published: (2024)
Similar Items
-
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
by: Jang, Eugene, et al.
Published: (2024) -
Distilling Token-Trained Models into Byte-Level Models
by: Bao, Zishuo, et al.
Published: (2026) -
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
by: Singh, Avyav Kumar, et al.
Published: (2026) -
GPUTOK: GPU Accelerated Byte Level BPE Tokenization
by: Kadamba, Venu Gopal, et al.
Published: (2026) -
An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification
by: Nguyen, Ba-Quang
Published: (2025)