Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hou, Shuyang, Hu, Yi, Zhang, Muhan
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.09089
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911374904393728
author	Hou, Shuyang Hu, Yi Zhang, Muhan
author_facet	Hou, Shuyang Hu, Yi Zhang, Muhan
contents	Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_09089
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding Hou, Shuyang Hu, Yi Zhang, Muhan Computation and Language Artificial Intelligence Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.
title	SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2601.09089

Similar Items