Explaining the Mixture-of-Experts (MoE) Architecture in Simple Terms

7 min readJan 9, 2024

You may have heard about the Mixture Of Experts (MoE) model architecture, particularly in reference to the Mixtral 8x7B. A common misconception about MoE is that it involves several “experts” (while using several of them simultaneously), each with dedicated competencies or trained in specific knowledge domains. For example, one might think that for code generation, the router sends requests to a single expert who independently handles all code generation tasks , or that another expert, proficient in math, manages all math-related inferences.
However, the reality of how MoE works is quite different. Let’s delve into this and I’ll explain it in simpler terms.

The Mixture-of-Experts (MoE) 101

The Mixture of Experts (MoE) model is a class of transformer models. MoEs, unlike traditional dense models, utilize a “sparse” approach where only a subset of the model’s components (the “experts”) are used for each input. This setup allows for more efficient pretraining and faster inference while managing a larger model size.

In MoEs, each expert is a neural network, typically a feed-forward network (FFN), and a gate network or router determines which tokens are sent to which expert. The experts specialize in different aspects of the input data, enabling the model to handle a wider range of tasks more efficiently.

An analogy for understanding MoEs is to consider a hospital with various specialized departments (experts). Each patient (input token) is directed to the appropriate department by the reception (router) based on their symptoms (data characteristics). Just as not all departments are involved in treating every patient, not all experts in an MoE are used for every input.

Comparatively, in a standard language model architecture (dense model), every part of the model is used for every input, similar to a general practitioner attempting to treat all aspects of every patient’s needs.

In summary, MoEs offer a more efficient and potentially faster approach to model training and inference by using a specialized subset of the model for each input, but they come with their own set of challenges, especially in terms of memory requirements and fine-tuning. But let's set that aside for now. Now, let's take a closer look at the term "expert."

What does the term "Expert" mean?

In the context of Mixture of Experts (MoE) models, the term “expert” generally refers to a component of the model that specializes in a specific type of task or pattern within the data, rather than a particular topic like finance, business, or IT. These experts are more about handling different kinds of computational patterns or tasks (like code generation, reasoning, summarization) rather than focusing on domain-specific knowledge.

Each expert in an MoE model is essentially a smaller neural network trained to be particularly effective at certain kinds of operations or patterns in the data. The model learns to route different parts of the input data to the most relevant expert. For instance, one expert might be more effective at dealing with numerical data, while another might specialize in natural language processing tasks.

The specialization of experts is largely determined by the data they are trained on and the structure of the model itself. It’s more about the nature of the computational task (e.g., recognizing certain patterns, dealing with certain types of input) than about domain-specific knowledge.

However, it’s theoretically possible to design MoE models where different experts are trained on different domains of knowledge (like specific topics), but this is more about the design choice and training approach rather than an inherent feature of the MoE architecture. In practice, MoE models tend to be used more for their computational efficiency and flexibility in handling a variety of tasks within a large-scale model.

How are Mixture-Of-Expert models trained?

The training of a Mixture of Experts (MoE) model, where each expert becomes better at a specific type of inference, is a nuanced process. It’s not as straightforward as directly training each expert on a specific task or domain. Instead, the specialization of experts in an MoE model typically emerges naturally over the course of training due to a combination of the model’s architecture and the data it’s exposed to. Here’s an overview of how this happens:

Diverse Data: The model is trained on a diverse dataset that encompasses a wide range of tasks or domains. This diversity is crucial as it exposes the experts to different types of data and problems.
Routing Mechanism: MoE models have a routing mechanism (often a trainable gate) that decides which expert handles which part of the input data. During training, this gate learns to route different types of data to different experts based on their emerging specialties.
Expert Specialization: As training progresses, each expert gradually becomes more adept at handling certain types of data or tasks. This specialization occurs because experts receive and learn from the types of data they are most effective at processing, as directed by the routing mechanism.
Feedback Loop: There’s a feedback loop in play — as an expert gets better at a certain type of data, the router becomes more likely to send similar data to that expert in the future. This reinforces the specialization of each expert.
Regularization and Loss Function: The training process often includes regularization techniques and specialized loss functions to encourage efficient learning and avoid scenarios where one expert becomes a “jack of all trades”, thereby ensuring distributed specialization.
Capacity Constraints: By imposing capacity constraints on experts, the model ensures that no single expert becomes overloaded with tasks, promoting a balanced distribution of learning across all experts.
Fine-tuning and Adjustments: The model might go through fine-tuning phases where certain types of tasks are emphasized, further refining the expertise of each component.

Please, bear this in mind, that while experts specialize, they typically don’t become exclusively dedicated to one narrow type of task. Instead, their “expertise” should be understood as a relative increase in efficiency or effectiveness for certain types of data or tasks, rather than a strict domain limitation.

Okay, how can all this knowledge be applied to the recently released Mixtral 8x7B?

Mixtral 8x7B

Mixtral 8x7B, introduced in the paper “Mixtral of Experts”, is a Sparse Mixture of Experts (SMoE) language model with distinctive features:

It shares the same architecture as Mistral 7B, but each layer is composed of 8 feedforward blocks (experts).
For every token at each layer, a router network selects two experts to process the current state and combine their outputs. The selected experts can vary at each timestep, allowing each token to access 47B parameters but actively using only 13B parameters during inference.

Visualizing the process of inference and token processing in Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) model, involves understanding how tokens are routed through various experts. Here’s a step-by-step illustration:

Input Token Reception: Imagine a stream of input tokens entering the model. Each token can represent a piece of text, code, or other data types.
Routing Decision: Each token is evaluated by a router network. The router decides which two of the eight available experts (feedforward blocks) should process this particular token. This decision is based on the characteristics of the token and the specialized functions of the experts.
Expert Processing: The chosen experts independently process the token. Each expert applies its own neural network layers, which are specialized for certain types of data or tasks. For example, one expert might be better at processing natural language, while another might be more effective with numerical data.
Combining Outputs: After processing, the outputs from the two selected experts are combined. This could involve averaging the outputs, concatenating them, or using some other method to merge the information gleaned by each expert.
Continuing Through Layers: This process repeats for each layer of the model. At every layer, the router network can choose a different pair of experts based on the current state of the token.
Final Output Generation: After the token has passed through all the layers, the final output is generated. This could be a prediction, a piece of generated text, or some other form of processed data.

Sparse Activation: It’s important to note that at any given time, only a fraction of the total parameters (experts) are actively being used. This is what makes the model ‘sparse’ and efficient, especially during inference.

In a real-world scenario, this process is parallelized across many tokens and multiple GPUs, especially in a cloud computing environment.

“MoE” to come…

Are there any other Large Language Models leveraging the Mixture of Experts (MoE) approach available currently? Yes, there are numerous of them. You can find them all on the LLM Explorer, under the curated list of “Mixture-Of-Experts”.

Here, you can review the model parameters, evaluate benchmarks, and compare their performance against reference models such as GPT-4.