:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Fountzoulas, George
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.07969
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
by: Jang, Eugene, et al.
Published: (2024)

Distilling Token-Trained Models into Byte-Level Models
by: Bao, Zishuo, et al.
Published: (2026)

Cross-Tokenizer LLM Distillation through a Byte-Level Interface
by: Singh, Avyav Kumar, et al.
Published: (2026)

GPUTOK: GPU Accelerated Byte Level BPE Tokenization
by: Kadamba, Venu Gopal, et al.
Published: (2026)

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification
by: Nguyen, Ba-Quang
Published: (2025)

An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification
by: Rusli, Andre, et al.
Published: (2024)

Token Masking Improves Transformer-Based Text Classification
by: Xu, Xianglong, et al.
Published: (2025)

Byte BPE Tokenization as an Inverse string Homomorphism
by: Geng, Saibo, et al.
Published: (2024)

Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer
by: Gao, Jia, et al.
Published: (2025)

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
by: Phan, Buu, et al.
Published: (2024)

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
by: Hu, Yifan, et al.
Published: (2025)

Back to Bytes: Revisiting Tokenization Through UTF-8
by: Moryossef, Amit, et al.
Published: (2025)

Text Classification Based on Knowledge Graphs and Improved Attention Mechanism
by: Li, Siyu, et al.
Published: (2024)

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
by: Deng, Chunyuan, et al.
Published: (2026)

Byte Latent Transformer: Patches Scale Better Than Tokens
by: Pagnoni, Artidoro, et al.
Published: (2024)

Towards Token-Level Text Anomaly Detection
by: Cao, Yang, et al.
Published: (2026)

BanglaByT5: Byte-Level Modelling for Bangla
by: Bhattacharyya, Pramit, et al.
Published: (2025)

Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns
by: Mihaila, George
Published: (2026)

Token Prediction as Implicit Classification to Identify LLM-Generated Text
by: Chen, Yutian, et al.
Published: (2023)

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models
by: Shravan, Rohan
Published: (2026)

MambaByte: Token-free Selective State Space Model
by: Wang, Junxiong, et al.
Published: (2024)

Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
by: Xia, Han, et al.
Published: (2024)

Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
by: Deng, Difan, et al.
Published: (2026)

Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
by: Wu, Chenwang, et al.
Published: (2026)

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
by: Gigant, Théo, et al.
Published: (2026)

Multi-Token Attention
by: Golovneva, Olga, et al.
Published: (2025)

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)

Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need
by: Skiredj, Abderrahman, et al.
Published: (2024)

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood
by: Lin, Xingyu, et al.
Published: (2026)

Advancing Text Classification with Large Language Models and Neural Attention Mechanisms
by: Lyu, Ning, et al.
Published: (2025)

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
by: Yun, Jungmin, et al.
Published: (2024)

SpaceByte: Towards Deleting Tokenization from Large Language Modeling
by: Slagle, Kevin
Published: (2024)

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood
by: Lin, Xingyu, et al.
Published: (2025)

Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
by: Sapkota, Ganesh, et al.
Published: (2025)

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
by: Foroutan, Negar, et al.
Published: (2025)

MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
by: Kallini, Julie, et al.
Published: (2024)

Explainability-Based Token Replacement on LLM-Generated Text
by: Mohammadi, Hadi, et al.
Published: (2025)

UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
by: Firestone, Preston, et al.
Published: (2025)

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal
by: Lian, Haoran, et al.
Published: (2024)

Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification
by: Akabe, Koichi, et al.
Published: (2024)