Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sood, Aryan, Sharma, Tanvi, Agrawal, Vansh
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2602.02199
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918319494266880
author	Sood, Aryan Sharma, Tanvi Agrawal, Vansh
author_facet	Sood, Aryan Sharma, Tanvi Agrawal, Vansh
contents	While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_02199
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression Sood, Aryan Sharma, Tanvi Agrawal, Vansh Artificial Intelligence Computation and Language While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.
title	More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression
topic	Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2602.02199

Similar Items