Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Hongfei, Song, Yang, Liu, Qiuhui, van Genabith, Josef, Xiong, Deyi
Format:	Preprint
Published:	2020
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2007.06257
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909160005697536
author	Xu, Hongfei Song, Yang Liu, Qiuhui van Genabith, Josef Xiong, Deyi
author_facet	Xu, Hongfei Song, Yang Liu, Qiuhui van Genabith, Josef Xiong, Deyi
contents	Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.
format	Preprint
id	arxiv_https___arxiv_org_abs_2007_06257
institution	arXiv
publishDate	2020
record_format	arxiv
spellingShingle	Rewiring the Transformer with Depth-Wise LSTMs Xu, Hongfei Song, Yang Liu, Qiuhui van Genabith, Josef Xiong, Deyi Computation and Language Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.
title	Rewiring the Transformer with Depth-Wise LSTMs
topic	Computation and Language
url	https://arxiv.org/abs/2007.06257

Similar Items