Guardado en:
Detalles Bibliográficos
Autores principales: Opper, Mattia, Fernandez, Roland, Smolensky, Paul, Gao, Jianfeng
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2503.23174
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866918154932846592
author Opper, Mattia
Fernandez, Roland
Smolensky, Paul
Gao, Jianfeng
author_facet Opper, Mattia
Fernandez, Roland
Smolensky, Paul
Gao, Jianfeng
contents Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.
format Preprint
id arxiv_https___arxiv_org_abs_2503_23174
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle TRA: Better Length Generalisation with Threshold Relative Attention
Opper, Mattia
Fernandez, Roland
Smolensky, Paul
Gao, Jianfeng
Machine Learning
Computation and Language
Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.
title TRA: Better Length Generalisation with Threshold Relative Attention
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2503.23174