Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chelba, Ciprian, Chen, Mia, Bapna, Ankur, Shazeer, Noam
Format:	Preprint
Published:	2020
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2001.04589
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915069484335104
author	Chelba, Ciprian Chen, Mia Bapna, Ankur Shazeer, Noam
author_facet	Chelba, Ciprian Chen, Mia Bapna, Ankur Shazeer, Noam
contents	Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.
format	Preprint
id	arxiv_https___arxiv_org_abs_2001_04589
institution	arXiv
publishDate	2020
record_format	arxiv
spellingShingle	Faster Transformer Decoding: N-gram Masked Self-Attention Chelba, Ciprian Chen, Mia Bapna, Ankur Shazeer, Noam Machine Learning Computation and Language Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.
title	Faster Transformer Decoding: N-gram Masked Self-Attention
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2001.04589

Similar Items