Guardado en:
Detalles Bibliográficos
Autores principales: Li, Zhexiang, Wang, Haoyu, Bao, Yutong, Woodruff, David
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2505.11040
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866908821250637824
author Li, Zhexiang
Wang, Haoyu
Bao, Yutong
Woodruff, David
author_facet Li, Zhexiang
Wang, Haoyu
Bao, Yutong
Woodruff, David
contents Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves most of the baseline accuracy, showing that the approach generalizes across modalities. We provide structural guarantees under a planted-subspace model, showing that clustering recovers the same heavy-key sets as leverage-based methods. Overall, pre-scoring improves the efficiency-accuracy trade-off of approximate attention by better prioritizing informative keys without sacrificing scalability.
format Preprint
id arxiv_https___arxiv_org_abs_2505_11040
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
Li, Zhexiang
Wang, Haoyu
Bao, Yutong
Woodruff, David
Machine Learning
Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves most of the baseline accuracy, showing that the approach generalizes across modalities. We provide structural guarantees under a planted-subspace model, showing that clustering recovers the same heavy-key sets as leverage-based methods. Overall, pre-scoring improves the efficiency-accuracy trade-off of approximate attention by better prioritizing informative keys without sacrificing scalability.
title Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
topic Machine Learning
url https://arxiv.org/abs/2505.11040