Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Opper, Mattia, Fernandez, Roland, Smolensky, Paul, Gao, Jianfeng
Formato:	Preprint
Publicado:	2025
Materias:	Machine Learning Computation and Language
Acceso en línea:	https://arxiv.org/abs/2503.23174
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866918154932846592
author	Opper, Mattia Fernandez, Roland Smolensky, Paul Gao, Jianfeng
author_facet	Opper, Mattia Fernandez, Roland Smolensky, Paul Gao, Jianfeng
contents	Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_23174
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TRA: Better Length Generalisation with Threshold Relative Attention Opper, Mattia Fernandez, Roland Smolensky, Paul Gao, Jianfeng Machine Learning Computation and Language Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.
title	TRA: Better Length Generalisation with Threshold Relative Attention
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2503.23174

Ejemplares similares