Saved in:
Bibliographic Details
Main Authors: Hao, Yongchang, Zhai, Mengyao, Hajimirsadeghi, Hossein, Hosseini, Sepidehsadat, Tung, Frederick
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.10571
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929758546165760
author Hao, Yongchang
Zhai, Mengyao
Hajimirsadeghi, Hossein
Hosseini, Sepidehsadat
Tung, Frederick
author_facet Hao, Yongchang
Zhai, Mengyao
Hajimirsadeghi, Hossein
Hosseini, Sepidehsadat
Tung, Frederick
contents Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.
format Preprint
id arxiv_https___arxiv_org_abs_2503_10571
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Radar: Fast Long-Context Decoding for Any Transformer
Hao, Yongchang
Zhai, Mengyao
Hajimirsadeghi, Hossein
Hosseini, Sepidehsadat
Tung, Frederick
Machine Learning
Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.
title Radar: Fast Long-Context Decoding for Any Transformer
topic Machine Learning
url https://arxiv.org/abs/2503.10571