Saved in:
Bibliographic Details
Main Authors: Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Yu, Hai, Liu, Jiaqing, Ma, Yukun, Zhang, Chong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.11274
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929388028690432
author Chen, Qian
Wang, Wen
Zhang, Qinglin
Zheng, Siqi
Zhang, Shiliang
Deng, Chong
Yu, Hai
Liu, Jiaqing
Ma, Yukun
Zhang, Chong
author_facet Chen, Qian
Wang, Wen
Zhang, Qinglin
Zheng, Siqi
Zhang, Shiliang
Deng, Chong
Yu, Hai
Liu, Jiaqing
Ma, Yukun
Zhang, Chong
contents The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in language modeling tasks, highlighting the effectiveness of our skip-layer attention mechanism.
format Preprint
id arxiv_https___arxiv_org_abs_2406_11274
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers
Chen, Qian
Wang, Wen
Zhang, Qinglin
Zheng, Siqi
Zhang, Shiliang
Deng, Chong
Yu, Hai
Liu, Jiaqing
Ma, Yukun
Zhang, Chong
Computation and Language
The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in language modeling tasks, highlighting the effectiveness of our skip-layer attention mechanism.
title Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers
topic Computation and Language
url https://arxiv.org/abs/2406.11274