Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hao, Yongchang, Zhai, Mengyao, Hajimirsadeghi, Hossein, Hosseini, Sepidehsadat, Tung, Frederick
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2503.10571
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929758546165760
author	Hao, Yongchang Zhai, Mengyao Hajimirsadeghi, Hossein Hosseini, Sepidehsadat Tung, Frederick
author_facet	Hao, Yongchang Zhai, Mengyao Hajimirsadeghi, Hossein Hosseini, Sepidehsadat Tung, Frederick
contents	Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_10571
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Radar: Fast Long-Context Decoding for Any Transformer Hao, Yongchang Zhai, Mengyao Hajimirsadeghi, Hossein Hosseini, Sepidehsadat Tung, Frederick Machine Learning Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.
title	Radar: Fast Long-Context Decoding for Any Transformer
topic	Machine Learning
url	https://arxiv.org/abs/2503.10571

Similar Items