Saved in:
Bibliographic Details
Main Authors: Zhang, Yu, Yang, Songlin, Zhu, Ruijie, Zhang, Yue, Cui, Leyang, Wang, Yiqiao, Wang, Bolun, Shi, Freda, Wang, Bailin, Bi, Wei, Zhou, Peng, Fu, Guohong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.07146
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929570032123904
author Zhang, Yu
Yang, Songlin
Zhu, Ruijie
Zhang, Yue
Cui, Leyang
Wang, Yiqiao
Wang, Bolun
Shi, Freda
Wang, Bailin
Bi, Wei
Zhou, Peng
Fu, Guohong
author_facet Zhang, Yu
Yang, Songlin
Zhu, Ruijie
Zhang, Yue
Cui, Leyang
Wang, Yiqiao
Wang, Bolun
Shi, Freda
Wang, Bailin
Bi, Wei
Zhou, Peng
Fu, Guohong
contents Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
format Preprint
id arxiv_https___arxiv_org_abs_2409_07146
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Zhang, Yu
Yang, Songlin
Zhu, Ruijie
Zhang, Yue
Cui, Leyang
Wang, Yiqiao
Wang, Bolun
Shi, Freda
Wang, Bailin
Bi, Wei
Zhou, Peng
Fu, Guohong
Computation and Language
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
title Gated Slot Attention for Efficient Linear-Time Sequence Modeling
topic Computation and Language
url https://arxiv.org/abs/2409.07146