Saved in:
Bibliographic Details
Main Authors: Chelba, Ciprian, Chen, Mia, Bapna, Ankur, Shazeer, Noam
Format: Preprint
Published: 2020
Subjects:
Online Access:https://arxiv.org/abs/2001.04589
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915069484335104
author Chelba, Ciprian
Chen, Mia
Bapna, Ankur
Shazeer, Noam
author_facet Chelba, Ciprian
Chen, Mia
Bapna, Ankur
Shazeer, Noam
contents Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.
format Preprint
id arxiv_https___arxiv_org_abs_2001_04589
institution arXiv
publishDate 2020
record_format arxiv
spellingShingle Faster Transformer Decoding: N-gram Masked Self-Attention
Chelba, Ciprian
Chen, Mia
Bapna, Ankur
Shazeer, Noam
Machine Learning
Computation and Language
Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.
title Faster Transformer Decoding: N-gram Masked Self-Attention
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2001.04589