Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Haoyu, Teng, Tong, Guo, Tianyu, Xiao, An, Tang, Duyu, Chen, Hanting, Wang, Yunhe
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.14477
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916622770372608
author	Wang, Haoyu Teng, Tong Guo, Tianyu Xiao, An Tang, Duyu Chen, Hanting Wang, Yunhe
author_facet	Wang, Haoyu Teng, Tong Guo, Tianyu Xiao, An Tang, Duyu Chen, Hanting Wang, Yunhe
contents	Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_14477
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression Wang, Haoyu Teng, Tong Guo, Tianyu Xiao, An Tang, Duyu Chen, Hanting Wang, Yunhe Computation and Language Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.
title	Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression
topic	Computation and Language
url	https://arxiv.org/abs/2502.14477

Similar Items