Saved in:
| Main Authors: | Bommarito, Michael J, Katz, Daniel Martin, Bommarito, Jillian |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.17247 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models
by: Bommarito II, Michael J, et al.
Published: (2025)
by: Bommarito II, Michael J, et al.
Published: (2025)
Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
by: Bommarito, Michael J, et al.
Published: (2025)
by: Bommarito, Michael J, et al.
Published: (2025)
Natural Language Processing in the Legal Domain
by: Hartung, Dirk, et al.
Published: (2023)
by: Hartung, Dirk, et al.
Published: (2023)
Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis
by: Bommarito II, Michael J.
Published: (2025)
by: Bommarito II, Michael J.
Published: (2025)
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
by: Bommarito II, Michael J.
Published: (2025)
by: Bommarito II, Michael J.
Published: (2025)
Needles at Scale: LLM-Assisted Target Selection for Windows Vulnerability Research
by: Bommarito II, Michael J.
Published: (2026)
by: Bommarito II, Michael J.
Published: (2026)
Token Alignment via Character Matching for Subword Completion
by: Athiwaratkun, Ben, et al.
Published: (2024)
by: Athiwaratkun, Ben, et al.
Published: (2024)
Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection
by: Bommarito II, Michael J.
Published: (2025)
by: Bommarito II, Michael J.
Published: (2025)
Empowering Character-level Text Infilling by Eliminating Sub-Tokens
by: Ren, Houxing, et al.
Published: (2024)
by: Ren, Houxing, et al.
Published: (2024)
From Language Models over Tokens to Language Models over Characters
by: Vieira, Tim, et al.
Published: (2024)
by: Vieira, Tim, et al.
Published: (2024)
Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token
by: Zychlinski, Shaked, et al.
Published: (2025)
by: Zychlinski, Shaked, et al.
Published: (2025)
Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction
by: Bei, Beicheng, et al.
Published: (2026)
by: Bei, Beicheng, et al.
Published: (2026)
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
by: Nguyen, Truong, et al.
Published: (2026)
by: Nguyen, Truong, et al.
Published: (2026)
Incorporating Domain Knowledge into Materials Tokenization
by: Oh, Yerim, et al.
Published: (2025)
by: Oh, Yerim, et al.
Published: (2025)
Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding
by: Zhu, Yifan, et al.
Published: (2026)
by: Zhu, Yifan, et al.
Published: (2026)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability
by: Lin, Zicheng, et al.
Published: (2024)
by: Lin, Zicheng, et al.
Published: (2024)
The Token Tax: Systematic Bias in Multilingual Tokenization
by: Lundin, Jessica M., et al.
Published: (2025)
by: Lundin, Jessica M., et al.
Published: (2025)
IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining
by: Feng, Dawei, et al.
Published: (2024)
by: Feng, Dawei, et al.
Published: (2024)
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
by: Ovcharov, Volodymyr
Published: (2026)
by: Ovcharov, Volodymyr
Published: (2026)
State over Tokens: Characterizing the Role of Reasoning Tokens
by: Levy, Mosh, et al.
Published: (2025)
by: Levy, Mosh, et al.
Published: (2025)
findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
by: Martínez, Héctor Javier Vázquez
Published: (2026)
by: Martínez, Héctor Javier Vázquez
Published: (2026)
Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR
by: Zhuang, Haomin, et al.
Published: (2025)
by: Zhuang, Haomin, et al.
Published: (2025)
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
by: Tan, Hongze, et al.
Published: (2025)
by: Tan, Hongze, et al.
Published: (2025)
Token-Level Uncertainty-Aware Objective for Language Model Post-Training
by: Liu, Tingkai, et al.
Published: (2025)
by: Liu, Tingkai, et al.
Published: (2025)
Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference
by: Han, Chao, et al.
Published: (2025)
by: Han, Chao, et al.
Published: (2025)
A Theory for Token-Level Harmonization in Retrieval-Augmented Generation
by: Xu, Shicheng, et al.
Published: (2024)
by: Xu, Shicheng, et al.
Published: (2024)
T-REG: Preference Optimization with Token-Level Reward Regularization
by: Zhou, Wenxuan, et al.
Published: (2024)
by: Zhou, Wenxuan, et al.
Published: (2024)
Token-Level LLM Collaboration via FusionRoute
by: Xiong, Nuoya, et al.
Published: (2026)
by: Xiong, Nuoya, et al.
Published: (2026)
TokenButler: Token Importance is Predictable
by: Akhauri, Yash, et al.
Published: (2025)
by: Akhauri, Yash, et al.
Published: (2025)
CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
by: Lu, Zhiyuan, et al.
Published: (2026)
by: Lu, Zhiyuan, et al.
Published: (2026)
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
by: Wang, Dixuan, et al.
Published: (2024)
by: Wang, Dixuan, et al.
Published: (2024)
TS-PEFT: Unveiling Token-Level Redundancy in Parameter-Efficient Fine-Tuning
by: Ma, Dabiao, et al.
Published: (2025)
by: Ma, Dabiao, et al.
Published: (2025)
Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models
by: Zhang, Xiang, et al.
Published: (2025)
by: Zhang, Xiang, et al.
Published: (2025)
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
by: Yun, Heecheol, et al.
Published: (2025)
by: Yun, Heecheol, et al.
Published: (2025)
SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
One Token Is Enough: Improving Diffusion Language Models with a Sink Token
by: Zhang, Zihou, et al.
Published: (2026)
by: Zhang, Zihou, et al.
Published: (2026)
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
by: Bhatia, Gagan, et al.
Published: (2024)
by: Bhatia, Gagan, et al.
Published: (2024)
Scalable Token-Level Hallucination Detection in Large Language Models
by: Min, Rui, et al.
Published: (2026)
by: Min, Rui, et al.
Published: (2026)
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
by: Li, Jiahui, et al.
Published: (2024)
by: Li, Jiahui, et al.
Published: (2024)
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels
by: Ye, Junjie, et al.
Published: (2025)
by: Ye, Junjie, et al.
Published: (2025)
Similar Items
-
The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models
by: Bommarito II, Michael J, et al.
Published: (2025) -
Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
by: Bommarito, Michael J, et al.
Published: (2025) -
Natural Language Processing in the Legal Domain
by: Hartung, Dirk, et al.
Published: (2023) -
Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis
by: Bommarito II, Michael J.
Published: (2025) -
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
by: Bommarito II, Michael J.
Published: (2025)