Salvato in:
Dettagli Bibliografici
Autori principali: Zhou, Yanqi, Du, Nan, Huang, Yanping, Peng, Daiyi, Lan, Chang, Huang, Da, Shakeri, Siamak, So, David, Dai, Andrew, Lu, Yifeng, Chen, Zhifeng, Le, Quoc, Cui, Claire, Laudon, James, Dean, Jeff
Natura: Preprint
Pubblicazione: 2023
Soggetti:
Accesso online:https://arxiv.org/abs/2306.00008
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866911851946704896
author Zhou, Yanqi
Du, Nan
Huang, Yanping
Peng, Daiyi
Lan, Chang
Huang, Da
Shakeri, Siamak
So, David
Dai, Andrew
Lu, Yifeng
Chen, Zhifeng
Le, Quoc
Cui, Claire
Laudon, James
Dean, Jeff
author_facet Zhou, Yanqi
Du, Nan
Huang, Yanping
Peng, Daiyi
Lan, Chang
Huang, Da
Shakeri, Siamak
So, David
Dai, Andrew
Lu, Yifeng
Chen, Zhifeng
Le, Quoc
Cui, Claire
Laudon, James
Dean, Jeff
contents Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.
format Preprint
id arxiv_https___arxiv_org_abs_2306_00008
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Brainformers: Trading Simplicity for Efficiency
Zhou, Yanqi
Du, Nan
Huang, Yanping
Peng, Daiyi
Lan, Chang
Huang, Da
Shakeri, Siamak
So, David
Dai, Andrew
Lu, Yifeng
Chen, Zhifeng
Le, Quoc
Cui, Claire
Laudon, James
Dean, Jeff
Machine Learning
Computation and Language
Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.
title Brainformers: Trading Simplicity for Efficiency
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2306.00008