Guardado en:
| Autores principales: | , , , |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2505.11040 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| _version_ | 1866908821250637824 |
|---|---|
| author | Li, Zhexiang Wang, Haoyu Bao, Yutong Woodruff, David |
| author_facet | Li, Zhexiang Wang, Haoyu Bao, Yutong Woodruff, David |
| contents | Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves most of the baseline accuracy, showing that the approach generalizes across modalities. We provide structural guarantees under a planted-subspace model, showing that clustering recovers the same heavy-key sets as leverage-based methods. Overall, pre-scoring improves the efficiency-accuracy trade-off of approximate attention by better prioritizing informative keys without sacrificing scalability. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2505_11040 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers Li, Zhexiang Wang, Haoyu Bao, Yutong Woodruff, David Machine Learning Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves most of the baseline accuracy, showing that the approach generalizes across modalities. We provide structural guarantees under a planted-subspace model, showing that clustering recovers the same heavy-key sets as leverage-based methods. Overall, pre-scoring improves the efficiency-accuracy trade-off of approximate attention by better prioritizing informative keys without sacrificing scalability. |
| title | Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2505.11040 |