:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Forrester, Chris, Sulea, Octavia
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2505.08058
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Exploring Prompt-Based Methods for Zero-Shot Hypernym Prediction with Large Language Models
by: Tikhomirov, Mikhail, et al.
Published: (2024)

SHADE: Semantic Hypernym Annotator for Domain-specific Entities -- DnD Domain Use Case
by: Peiris, Akila, et al.
Published: (2024)

Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet
by: Augello, Lorenzo, et al.
Published: (2025)

HyperBox: A Supervised Approach for Hypernym Discovery using Box Embeddings
by: Parmar, Maulik, et al.
Published: (2022)

Hypernym Bias: Unraveling Deep Classifier Training Dynamics through the Lens of Class Hierarchy
by: Malashin, Roman, et al.
Published: (2025)

On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction
by: Bondarenko, Ivan, et al.
Published: (2026)

Beyond Text Compression: Evaluating Tokenizers Across Scales
by: Lotz, Jonas F., et al.
Published: (2025)

Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics
by: R V, Kavin, et al.
Published: (2025)

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction
by: Zou, Yuchun, et al.
Published: (2026)

KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction
by: Yuan, Aomufei, et al.
Published: (2025)

Tokenization Is More Than Compression
by: Schmidt, Craig W., et al.
Published: (2024)

See the Text: From Tokenization to Visual Reading
by: Xing, Ling, et al.
Published: (2025)

Greed is All You Need: An Evaluation of Tokenizer Inference Methods
by: Uzan, Omri, et al.
Published: (2024)

ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine Translation
by: Dai, Shaojie, et al.
Published: (2024)

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
by: Kim, Eunji, et al.
Published: (2024)

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
by: Zhu, Mingcheng, et al.
Published: (2026)

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
by: Goldman, Omer, et al.
Published: (2024)

Frequency-Ordered Tokenization for Better Text Compression
by: Kalcher, Maximilian
Published: (2026)

CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation
by: Lin, Xiaolin, et al.
Published: (2025)

Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction
by: Mao, Yu, et al.
Published: (2025)

zip2zip: Inference-Time Adaptive Tokenization via Online Compression
by: Geng, Saibo, et al.
Published: (2025)

Learning to Compress Prompts with Gist Tokens
by: Mu, Jesse, et al.
Published: (2023)

LLM-Augmented Semantic Steering of Text Embedding Projection Spaces
by: Liu, Wei, et al.
Published: (2026)

Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction
by: Yang, Xiaoli, et al.
Published: (2026)

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
by: Trukhina, Natalia, et al.
Published: (2026)

Faster Superword Tokenization
by: Schmidt, Craig W., et al.
Published: (2026)

Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations
by: Zhao, Yize, et al.
Published: (2025)

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
by: Shani, Chen, et al.
Published: (2025)

Multi-word Tokenization for Sequence Compression
by: Gee, Leonidas, et al.
Published: (2024)

Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR
by: Crawford, Chris
Published: (2025)

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
by: Moroni, Luca, et al.
Published: (2025)

Text2Token: Unsupervised Text Representation Learning with Token Target Prediction
by: An, Ruize, et al.
Published: (2025)

Lossless Token Sequence Compression via Meta-Tokens
by: Harvill, John, et al.
Published: (2025)

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution
by: Chizhov, Pavel, et al.
Published: (2026)

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
by: Song, Yuhan, et al.
Published: (2025)

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
by: Manakul, Potsawee, et al.
Published: (2026)

A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
by: Nie, Zhijie, et al.
Published: (2024)

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
by: Zhang, Jiebin, et al.
Published: (2024)

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
by: Belikova, Julia, et al.
Published: (2026)

Text Compression for Efficient Language Generation
by: Gu, David, et al.
Published: (2025)