Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jiang, Albert Q., Sablayrolles, Alexandre, Roux, Antoine, Mensch, Arthur, Savary, Blanche, Bamford, Chris, Chaplot, Devendra Singh, Casas, Diego de las, Hanna, Emma Bou, Bressand, Florian, Lengyel, Gianna, Bour, Guillaume, Lample, Guillaume, Lavaud, Lélio Renard, Saulnier, Lucile, Lachaux, Marie-Anne, Stock, Pierre, Subramanian, Sandeep, Yang, Sophia, Antoniak, Szymon, Scao, Teven Le, Gervet, Théophile, Lavril, Thibaut, Wang, Thomas, Lacroix, Timothée, Sayed, William El
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2401.04088
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914634446929920
author	Jiang, Albert Q. Sablayrolles, Alexandre Roux, Antoine Mensch, Arthur Savary, Blanche Bamford, Chris Chaplot, Devendra Singh Casas, Diego de las Hanna, Emma Bou Bressand, Florian Lengyel, Gianna Bour, Guillaume Lample, Guillaume Lavaud, Lélio Renard Saulnier, Lucile Lachaux, Marie-Anne Stock, Pierre Subramanian, Sandeep Yang, Sophia Antoniak, Szymon Scao, Teven Le Gervet, Théophile Lavril, Thibaut Wang, Thomas Lacroix, Timothée Sayed, William El
author_facet	Jiang, Albert Q. Sablayrolles, Alexandre Roux, Antoine Mensch, Arthur Savary, Blanche Bamford, Chris Chaplot, Devendra Singh Casas, Diego de las Hanna, Emma Bou Bressand, Florian Lengyel, Gianna Bour, Guillaume Lample, Guillaume Lavaud, Lélio Renard Saulnier, Lucile Lachaux, Marie-Anne Stock, Pierre Subramanian, Sandeep Yang, Sophia Antoniak, Szymon Scao, Teven Le Gervet, Théophile Lavril, Thibaut Wang, Thomas Lacroix, Timothée Sayed, William El
contents	We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_04088
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Mixtral of Experts Jiang, Albert Q. Sablayrolles, Alexandre Roux, Antoine Mensch, Arthur Savary, Blanche Bamford, Chris Chaplot, Devendra Singh Casas, Diego de las Hanna, Emma Bou Bressand, Florian Lengyel, Gianna Bour, Guillaume Lample, Guillaume Lavaud, Lélio Renard Saulnier, Lucile Lachaux, Marie-Anne Stock, Pierre Subramanian, Sandeep Yang, Sophia Antoniak, Szymon Scao, Teven Le Gervet, Théophile Lavril, Thibaut Wang, Thomas Lacroix, Timothée Sayed, William El Machine Learning Computation and Language We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
title	Mixtral of Experts
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2401.04088

Similar Items