Mixture of Experts

5 min readDec 14, 2023

Mixture of Experts (MoE) is a machine learning technique that combines multiple expert models to improve the performance. This technique has gained attention during the last few years and even more with the release of Mistral AI MoE model 8x7B, However, this technique is older than you think and for sure not limited to large language models or even natural language processing.

The concept of MoE was introduced by Michael I. Jordan, Robert A. Jacobs, and Geoffrey E. Hinton in the paper “Hierarchical Mixtures of Experts and the EM Algorithm” published in 1991. The authors proposed the MoE model as a way to address complex and non-linear relationships in data by combining the strengths of multiple specialized models.

To summarize, the MoE is a technique to combine a group of “expert” models, each specialized in a task, in order to handle complex data distributions. For example: an expert model can be as small as an MLP and as big as a large language model.

Mixture of Experts Architecture

The foundation of MoE lies in the concept of expert models and a gating mechanism.

Mixture of Expert ‘very general’ architecture

Experts: Each expert model is a distinct neural network trained to excel at a particular task or subset of data. This specialization enables experts to handle complex data distributions more effectively.

Gating mechanism: The second big component of an MoE is the gating mechanism or the Router. This mechanism plays a crucial role in MoE by smartly selecting the appropriate expert for each input. It serves as a router, ensuring that only the expert most suited for the current task contributes to the final prediction. This dynamic selection process is crucial for maximizing the overall performance of the MoE model.

Training an MoE

Training an MoE consists of optimizing simultaneously the parameters (weights) of the expert models and the router (gating mechanism). The objective is to obtain experts highly specialized and a router that consistently selects the most suitable expert for each input.

The MoE models were traditionally trained using variants of the expectation-maximization (EM) algorithm.

E-step: The router assigns probabilities to each expert based on the current predictions.
M-step: The experts and router weights are updated to maximize the overall performance of the model.

Other more efficient training methods for MoE models have been proposed recently such as online learning algorithms and reinforcement learning techniques. These approaches aim to streamline the training process and reduce the computational complexity associated with traditional EM-based methods.

Variants of MoE

Mixture of Experts (MoE) models can be categorized into different types based on variations in their architectures and training methods. Here are a few variations and extensions of the basic MoE model:

Hard MoE: The original MoE formulation used gating mechanism that assigns a “hard” assignment of input data to specific experts. This means that a single selected expert makes prediction for a selected the given input.
Soft MoE: The Soft mixture uses a soft assignment mechanism where the gating mechanism assigns probabilities to multiple experts. These probabilities are used to compute the final prediction as a weighted sum of the expert predictions.
Sparse MoE: This type os mixture uses a subset of experts for a particular input. This is achieved by designing the gating network to output sparse weights.
Adaptive MoE: The number of active experts adapt dynamically.
Hierarchical MoE: This is a multi-layer of expertise models. For example, the first layer has low-level experts that focus on local features and a second layer that focuses on more global features.

Advantages of MoE

Mixture of experts is a powerful technique that can help improve overall performance. It offers several advantages, including:

Specialization and Robustness: MoE leverages the specialization of each expert to handle complex data distributions more effectively and adapt to diverse patterns. This specialization allows the MoE models to have improved robustness and generalization, performing well on a wider range of input data.
Reduced Model Complexity: MoE divides the learning task among multiple experts, reducing the complexity of each expert model. This modularization allows for more efficient training and reduces the risk of overfitting. Additionally, the gating mechanism ensures that only relevant experts contribute to the final prediction, further simplifying the model structure.
Adaptive Decision-Making: The gating mechanism enables the model to handle diverse data points effectively, adapting to specific patterns and behaviors. Therefore, MoE models can provide more accurate and personalized predictions.
Scalability: MoE can be efficiently scaled to handle large datasets and complex tasks by adding more experts. The gating mechanism ensures that only a subset of experts is activated for each input, minimizing computational overhead.
Flexibility: MoE can incorporate different types of expert models and gating mechanisms, making it highly versatile and adaptable to various applications.

Challenges of MoE

Despite their promising advantages, MoE models also present several challenges that need to be addressed to fully realize their potential:

Data Segmentation and Expert Allocation: Effectively segmenting the training data into distinct expert domains can be challenging, especially for tasks with overlapping language patterns.
Expert Selection and Routing: The gating network is a very critical part and should be able to dynamically adapt to the input sequence and select the most relevant experts, while also maintaining load balance across the experts to prevent bottlenecks.
Training Complexity: Training MoE models can be more complex than training traditional dense models due to the additional complexity of the gating network and the interactions between multiple experts. This complexity can make it challenging to optimize the parameters of the model and ensure that each expert is learning effectively. This complexity is intrinsic to the design of the model.
Parameter Efficiency: While MoE models can achieve improved efficiency by using fewer parameters, the additional complexity of the gating network and the interactions between experts can require a significant number of parameters.

In conclusion, Mixture of Experts (MoE) is a promising architectural technique that can be effectively applied to both natural language processing (NLP) and computer vision (CV) tasks. By decomposing large models into smaller, specialized sub-models, MoE enables efficient training and deployment, while maintaining or even surpassing the performance of traditional dense architectures. This technique might be good choice for Large Language Models and Large Vision Models but it comes with its challenges and drawbacks!

Notable Papers & Links

Application of MoE in NLP: Switch Transformer [Paper]
Application of MoE in Vision: [Paper]
Sparse MoE: [Paper]