Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zou, Yuchun, Tong, Junhong, Li, Jun
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2605.29000
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917540959092736
author	Zou, Yuchun Tong, Junhong Li, Jun
author_facet	Zou, Yuchun Tong, Junhong Li, Jun
contents	Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_29000
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction Zou, Yuchun Tong, Junhong Li, Jun Computation and Language Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.
title	Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction
topic	Computation and Language
url	https://arxiv.org/abs/2605.29000

Similar Items