Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kaul, Prannay, Ma, Chengcheng, Elezi, Ismail, Deng, Jiankang
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.17174
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912082162614272
author	Kaul, Prannay Ma, Chengcheng Elezi, Ismail Deng, Jiankang
author_facet	Kaul, Prannay Ma, Chengcheng Elezi, Ismail Deng, Jiankang
contents	We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are unable to do. In summary, our methods reduce the attention proportion on the first token from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to 3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_17174
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	From Attention to Activation: Unravelling the Enigmas of Large Language Models Kaul, Prannay Ma, Chengcheng Elezi, Ismail Deng, Jiankang Computation and Language We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are unable to do. In summary, our methods reduce the attention proportion on the first token from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to 3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3.
title	From Attention to Activation: Unravelling the Enigmas of Large Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2410.17174

Similar Items