_version_ 1866914634446929920
author Jiang, Albert Q.
Sablayrolles, Alexandre
Roux, Antoine
Mensch, Arthur
Savary, Blanche
Bamford, Chris
Chaplot, Devendra Singh
Casas, Diego de las
Hanna, Emma Bou
Bressand, Florian
Lengyel, Gianna
Bour, Guillaume
Lample, Guillaume
Lavaud, Lélio Renard
Saulnier, Lucile
Lachaux, Marie-Anne
Stock, Pierre
Subramanian, Sandeep
Yang, Sophia
Antoniak, Szymon
Scao, Teven Le
Gervet, Théophile
Lavril, Thibaut
Wang, Thomas
Lacroix, Timothée
Sayed, William El
author_facet Jiang, Albert Q.
Sablayrolles, Alexandre
Roux, Antoine
Mensch, Arthur
Savary, Blanche
Bamford, Chris
Chaplot, Devendra Singh
Casas, Diego de las
Hanna, Emma Bou
Bressand, Florian
Lengyel, Gianna
Bour, Guillaume
Lample, Guillaume
Lavaud, Lélio Renard
Saulnier, Lucile
Lachaux, Marie-Anne
Stock, Pierre
Subramanian, Sandeep
Yang, Sophia
Antoniak, Szymon
Scao, Teven Le
Gervet, Théophile
Lavril, Thibaut
Wang, Thomas
Lacroix, Timothée
Sayed, William El
contents We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
format Preprint
id arxiv_https___arxiv_org_abs_2401_04088
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Mixtral of Experts
Jiang, Albert Q.
Sablayrolles, Alexandre
Roux, Antoine
Mensch, Arthur
Savary, Blanche
Bamford, Chris
Chaplot, Devendra Singh
Casas, Diego de las
Hanna, Emma Bou
Bressand, Florian
Lengyel, Gianna
Bour, Guillaume
Lample, Guillaume
Lavaud, Lélio Renard
Saulnier, Lucile
Lachaux, Marie-Anne
Stock, Pierre
Subramanian, Sandeep
Yang, Sophia
Antoniak, Szymon
Scao, Teven Le
Gervet, Théophile
Lavril, Thibaut
Wang, Thomas
Lacroix, Timothée
Sayed, William El
Machine Learning
Computation and Language
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
title Mixtral of Experts
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2401.04088