Mixtral 8x7B explained

Kevin François
neoxia
Published in
7 min readMar 21, 2024

Introduction

The emergence of Mixtral 8x7B has brought back a distinct category of transformer models into the spotlight within the AI community: the Mixture of Experts, often abbreviated as MoEs. The MoE concept itself isn’t new, its roots can be traced back to the work of Shazeer et al. in 2017, and its application to Transformers language models began to gain traction in 2020 with Lepikhin et al. However, it’s the recent introduction of Mixtral of Experts (Mixtral 8x7B) that has sparked renewed interest in this domain. MoE architectures facilitate model pretraining with significantly less computational resources, offering the potential to either reduce the dataset size or increase the model size within the same computational budget. This review aims to expound upon the MoE framework and explore the added benefits that Mixtral 8x7B brings to these specific fields.

Main idea

The core concept of the Mixtral model entails integrating a Sparse Mixture of Experts (MoE) layer within a Transformer’s architecture block. In a typical Transformer architecture, this block consists of a sequence of N multi-head attention layers, which process attention factors across all possible input combinations via a self-attention mechanism, followed by a feedforward layer that estimates the final output through a softmax function.

Transformer structure and multi-head attention cell [10]

In Mixtral 8x7B, the traditional feedforward layer is substituted with a Mixture of Expert layer. Specifically, the feedforward layer is replaced by N individual feedforward layers alongside a router positioned beforehand. The number of layers, or experts, is predetermined and configurable before training. In the context of Mixtral 8x7B, after the attention head, 8 feedforward layers can potentially be routed. Ultimately, the router selects 2 experts based on the input context to aid in output prediction.

Illustration of a parse MoE block [5]

Original mechanism

The concept of Mixture of Experts (MoE) was initially introduced in 1991 by Jacobs, Robert A., et al., who proposed the first dense MoE layer. The fundamental principle behind MoE is to model an output “y” by aggregating multiple “experts” E, with the weight of each expert controlled by a “gating network” G.

An expert within this framework can take various forms, but it’s commonly implemented as a multi-layered neural network. As for the gating function, the traditional approach typically employs a softmax function.

where W is a learnable matrix that assigns training examples to experts. When training MoE models, the learning objective is therefore two-fold:

  1. The experts will learn to process the output they’re given into the best possible output (i.e., a prediction)
  2. The gating network will learn to “route” the right training examples to the right experts, by jointly learning the routing matrix W.
Original architecture of Adaptive Mixtures of Local Experts

These foundational works paved the way for the exploration of mixture of experts within the realm of Natural Language Processing (NLP). Notably, Shazeer et al. in 2017 extended this concept by incorporating a top-k approach. Top-k gating entails computing the expert output from only the top k experts (determined by the gate) for each training example, disregarding all other experts. The primary motivation behind this approach is computational efficiency; for instance, if there are 20 experts and top-k gating with k=5 is applied, the total computational cost of training the model is reduced by a factor of 4.

Schematic of top-k routing [1]

More precisely, the top experts are chosen via a routing process that involves performing a product operation with each input and the learned matrix W, thereby generating router scores (h). These scores are then subjected to normalization, followed by the selection of the k highest scores.

Mixtral application

In the Mixtral model, the Mixture of Experts (MoE) layer is seamlessly integrated into the Transformer architecture on a per-token basis, replacing the conventional feed-forward (FFN) sub-block within the transformer block. Google showcased the initial application of MoE within transformer models through their pioneering GShard framework.

GShard stands as a state-of-the-art data-parallelism library designed to efficiently train large-scale deep learning models, tackling the challenges of scaling model training across multiple accelerators while maintaining optimal efficiency and minimizing communication overhead. Through a hierarchical sharding strategy, GShard distributes model parameters across accelerators, supporting diverse deep learning tasks such as language models and image recognition. This approach effectively reduces communication overhead and maximizes parallelism, leading to notable enhancements in training efficiency for large-scale models. GShard, built upon a Sparsely-Gated Mixture-of-Experts method, replaces every FFN layer with an MoE layer employing top-2 gating, seamlessly implemented in both the encoder and decoder components.

Illustration of scaling of Transformer Encoder with MoE Layers [4]

Similarly, in Mixtral 8x7B, a comparable approach is employed, featuring 8 fully connected feedforward networks with a SwiGLU activation function (a variant of the ReLU activation function) alongside a top-2 gating mechanism.

Routing strategies

Mixtral was compared to Llama, on the following benchmarks:

Commonsense Reasoning (0-shot): Commonsense reasoning benchmarks, such as the CommonsenseQA dataset, evaluate a model’s ability to answer questions that require common sense knowledge. In a 0-shot setting, the model is not provided with any training examples and is expected to reason using only its pre-existing knowledge. This tests the model’s ability to apply general knowledge to novel situations.

World Knowledge (5-shot): In a 5-shot setting, a model is provided with 5 examples of the task it needs to perform before being tested on new examples. This allows the model to learn from a small amount of data before being evaluated on its ability to apply this knowledge to new situations. World knowledge benchmarks, such as the ReClor dataset, assess a model’s ability to reason and answer questions that require external knowledge beyond the provided context

Reading Comprehension (0-shot): BoolQ, QuAC: Reading comprehension benchmarks, such as BoolQ and QuAC, evaluate a model’s ability to comprehend and answer questions about a given passage. In a 0-shot setting, the model is not provided with any examples from the specific benchmark during training and is tested on its ability to generalize to new passages and questions.

Math: Math benchmarks test a model’s ability to solve mathematical problems, ranging from simple arithmetic to complex reasoning. These benchmarks evaluate the model’s understanding of mathematical concepts and its problem-solving capabilities.

Code: Code benchmarks assess a model’s ability to understand and generate code. This can include tasks such as code completion, translation between programming languages, or code summarization. The evaluation measures how well the model can work with programming languages and understand the logic behind different code snippets.

Popular Aggregated Results: Popular aggregated results refer to comprehensive evaluations that combine performance on multiple benchmarks or tasks. These results provide an overall measure of a model’s capabilities across different areas of AI, such as language understanding, reasoning, and problem-solving.

Performance of Mixtral and different Llama models on a wide range of benchmarks [2]

Mixtral surpasses Llama 2 70B across most metrics. In particular, Mixtral displays a superior performance in code and mathematics benchmarks.

In summary

  • MoE models fundamentally represent the output y as a weighted combination of experts, where each expert functions as a small neural network, and the weighting is determined by G(x) = softmax(Wx), with W being a trainable matrix.
  • The adoption of top-k gating, where only the top k experts contribute to the output, revolutionized expert modeling by significantly reducing computational demands.
  • Expert Choice Routing introduced a paradigm shift by enabling experts to select their training examples instead of vice versa, leading to enhanced training stability without the need for additional auxiliary losses.
  • Mixtral 8x7B employs a configuration featuring 8 experts, mirroring the architectural design of Google’s GShard model.

References

  1. FEDUS, William, DEAN, Jeff, et ZOPH, Barret. A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667, 2022.
  2. JIANG, Albert Q., SABLAYROLLES, Alexandre, ROUX, Antoine, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  3. SHAZEER, Noam, MIRHOSEINI, Azalia, MAZIARZ, Krzysztof, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  4. LEPIKHIN, Dmitry, LEE, HyoukJoong, XU, Yuanzhong, et al. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  5. FEDUS, William, ZOPH, Barret, et SHAZEER, Noam. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, vol. 23, no 1, p. 5232–5270.
  6. JACOBS, Robert A., JORDAN, Michael I., NOWLAN, Steven J., et al. Adaptive mixtures of local experts. Neural computation, 1991, vol. 3, no 1, p. 79–87.
  7. SHAZEER, Noam, MIRHOSEINI, Azalia, MAZIARZ, Krzysztof, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  8. ZHOU, Yanqi, LEI, Tao, LIU, Hanxiao, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 2022, vol. 35, p. 7103–7114.
  9. LEPIKHIN, Dmitry, LEE, HyoukJoong, XU, Yuanzhong, et al. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  10. WEI, Xiaokai, GONUGONDLA, Sujan, AHMAD, Wasi, et al. Greener yet Powerful: Taming Large Code Generation Models with Quantization. arXiv preprint arXiv:2303.05378, 2023.

--

--