Mixtral 8x7B: A Breakthrough in Sparse Mixture of Experts Models

3 min readFeb 1, 2024

Introduction

In recent years, the field of natural language processing has seen remarkable advancements, with models like GPT-3.5 and Llama 2 setting new benchmarks. However, the limitations of these models, such as computational cost and latency, have led to the development of more efficient alternatives.

Mixtral 8x7B is an open-source answer to these problems. Not only does it manage to beat Llama 2 (and even GPT-3.5) on several benchmarks, it does so with fewer weights. Not only does it have a smaller total number of weights, but it is far more computationally efficient by picking only 14B at a time. And, if that wasn’t enough, it has a context length of 32k tokens!

Understanding Mixtral Architecture

Mixtral is a sparse Mixture-of-Experts (MoE) network, specifically designed as a decoder-only model with a unique feedforward block that selects from 8 distinct groups of parameters. A router network decides which two groups ( “experts”) will process the token and combines their outputs with a weighted sum. This approach significantly increases the model’s parameter count while maintaining control over cost and latency, as only a subset of parameters is utilized for each token.

The model is trained using multilingual data with a context size of 32k tokens. Mixtral uses a dimension of 4096 and 32 layers.

The Router Network

For each layer and each token, Mixtral uses a “router” or “gating network” to pick the 2 experts.

Mixtral uses a router consisting of a linear layer and a softmax to generate weights for all 8 of its experts. The top K of these are selected (they set K=2) and the rest are 0 (after the softmax).

When weights are 0, Mixtral skips the calculation for those experts.

The Mixture of Experts Layer

Each of the 2 expert layers receives 1 token at a time. This is passed back to the router and later sent to the next layer (and maybe a different expert!)

This lets the model make use of many different experts in a single query (so those weights aren’t wasted) — every token in every layer may be handled by different K experts.

The MoE layer replaces the feed-forward (FFN) sub-block of the transformer block. Mixtral uses the same SwiGLU architecture as the expert function Ei(x) and sets K = 2. This means each token is routed to two SwiGLU sub-blocks with different sets of weights.

Of the n-1 terms in the series, only 2 are non-zero, and the SWiGLU for the rest need not be calculated.

Sparse Mixture of Experts

The heart of Mixtral lies in its Sparse Mixture of Experts (MoE) layer. This layer’s output is determined by the weighted sum of outputs from expert networks, where the weights are provided by the gating network’s output. The sparsity of the gating vector allows efficient computation by avoiding the evaluation of experts with zero gates. Mixtral employs a sophisticated gating strategy, Softmax(TopK(x · Wg)), to choose the top-K experts per token.

Results and Comparisons

Mixtral has undergone evaluation against Llama 2 and GPT-3.5 on a diverse set of benchmarks, including commonsense reasoning, world knowledge, reading comprehension, math, and code generation.

Notably, Mixtral achieves these results with significantly fewer active parameters during inference, making it more efficient in terms of cost and performance. The evaluation includes a comparison with Llama 2 70B, showcasing Mixtral’s ability to outperform or match its performance while utilizing only a fraction of the parameters.

Written by,
Sumedh Chatterjee, Ryan Jacob George, B Shabarish
Coordinators, AI Club, CFI
IIT Madras