Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2020
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2001.04589 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915069484335104 |
|---|---|
| author | Chelba, Ciprian Chen, Mia Bapna, Ankur Shazeer, Noam |
| author_facet | Chelba, Ciprian Chen, Mia Bapna, Ankur Shazeer, Noam |
| contents | Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2001_04589 |
| institution | arXiv |
| publishDate | 2020 |
| record_format | arxiv |
| spellingShingle | Faster Transformer Decoding: N-gram Masked Self-Attention Chelba, Ciprian Chen, Mia Bapna, Ankur Shazeer, Noam Machine Learning Computation and Language Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task. |
| title | Faster Transformer Decoding: N-gram Masked Self-Attention |
| topic | Machine Learning Computation and Language |
| url | https://arxiv.org/abs/2001.04589 |