Snowflake Arctic Cookbook Series: Exploring Mixture of Experts (MoE)

On April 24 Snowflake Arctic was released to the world with a key goal in mind — to be truly open. As part of that initiative, the Snowflake AI Research team is going to deliver a series of cookbooks to describe how to pretrain, fine-tune, evaluate, and serve large-scale MoEs such as Arctic. We will share our journey of training the Arctic model, along with our findings related to sourcing and composing pre-training data, designing MoE architecture, co-designing models with training and inference systems in mind, and methods for fine-tuning and evaluating the models.

For the full series you can always find it here in our Snowflake Arctic cookbook catalog.

To kick off this series, this blog dives into the realm of LLM architecture, an early design decision when one wants to build a model.

In the realm of large language models (LLMs), traditionally, dense transformer architectures (as shown on the left of Figure 1) have been the go-to choice for researchers and practitioners. An important reason for this choice is the ability to scale the model size to improve model quality. However, increasing the model size beyond a certain point becomes prohibitively costly from a compute perspective due to dense computational complexity. In other words, the total amount of compute required to train the model increases linearly with the size of the model, making it challenging to scale the model size without significant investment in compute and training time.

Figure 1: Left demonstrates the dense architecture of the transformer architecture and right shows the illustration of the MoE architecture.

The MoE architecture promises improved model quality without increasing the compute cost of inference and training. The original MoE idea was proposed here and there have been subsequent rounds of improvement to the architecture and techniques e.g., DeepSpeed-MoE, Switch Transformer, and GLaM. The MoE architecture (as shown on the right of Figure 1) consists of multiple parallel feed forward networks (FFN) known as experts. The choice of experts is coordinated by a gating function, which determines the routing behavior of each token. Each token chooses the top-k (typically, k=1 or k=2) experts for calculation. Therefore, only part of the entire network is activated for each token, which essentially decouples the total model size from the total computation.

In particular, MoE has the following improvements over dense architectures:

  • Improved model quality at fixed training cost: By incorporating multiple experts, MoE models can effectively increase the capacity of the model. This allows for more specialized and diverse representations of the input data, ultimately resulting in higher quality models.
  • In the meantime, if we fix the number of experts selected for computation (aka fix top-k), the computational cost remains fixed, regardless of the number of experts as only a subset of experts are activated for each input token. As a result, MoE models can scale to larger sizes to improve model quality without a significant increase in computational requirements for training these models.

Fast and economical MoE inference: During inference, when the batch size is sufficiently large, the inference cost is related to the total number of active parameters which is significantly smaller than the total parameters in the model. Therefore, an MoE model can be more economical during inference than a quality-equivalent dense model where inference cost correlates with the total number of parameters, typically resulting in higher computational demands. The implications of this are dramatic. For example, consider the Arctic model: it houses a staggering 500 billion parameters, yet for each token processed, only 17 billion parameters are activated.

Quantifying Model Quality Improvement

To better understand how MoE models improve upon traditional dense models, we carried out a study to compare their performances directly. We looked at how a MoE with 1.6 billion active parameters compares to a 6.5 billion-parameter dense model, both trained with the same number of tokens: one trillion. Our findings show that the MoE-1.6B model not only performs better, achieving lower loss, but it also requires 4x less compute to train. The loss curve is shown in Figure 2.

For those interested in the technical details, the specific model architectures and core hyper-parameters are detailed in Table 1. We used AdamW optimizer with cosine learning rate decay strategy over batches of 4 million tokens, while warming up for the initial 2000 iterations.

Table 1: The architecture specification and learning rate used to train the model.
Figure 2: The loss curve of Dense-6.5B vs MoE-1.6B.

How to choose the best MoE Architecture?

While it is clear that MoE can significantly improve model quality for a given compute budget, it also introduces a set of unknowns. Unlike well-researched dense transformer architectures, MoE architectures are still in their infancy and not as extensively understood. Very limited research has been done to identify the best configurations — such as the optimal number of experts, the size of each expert, how many experts to activate at once (top-k gating), and the interval between layers at which experts are engaged. Each of these choices can dramatically affect the model’s effectiveness and efficiency given a fixed compute and parameter budget.

Here we try to shed some light on how to make these choices, focusing particularly on two critical design aspects:

  • Top-k selection: We explore how the choice of top-k, or number of experts activated for each input, impacts the model’s quality with fixed compute and parameter budget.
  • Frequency of MoE layers: Here we study how the frequency at which MoE layers are used within the model affect the model quality. In other words, we provide insight on whether we should replace a standard feedforward network (FFN) with an MoE variant in every transformer layer, every alternate layer, or less frequently.

Top-k selection: Top-1 vs. Top-2

Top-1 and Top-2 gating are the two most commonly used gating functions for MoE training. To fairly compare these approaches, we did the following adjustments:

  1. Halved the size of the FFN layer for Top-2 gating: This adjustment ensures that the total number of parameters activated remains consistent between Top-1 and Top-2 setups.
  2. Doubled the number of experts for Top-2: To compensate for the reduction in parameters per expert, we increased the number of experts, maintaining overall model size and compute resources.

For this experiment, designed to keep both the compute and model size constant, the loss comparison is shown in Figure 3. Interestingly, the Top-2 gating was found to be more effective than Top-1 gating, despite the controlled conditions. We also explored Top-3 and Top-4 gating functions, which showed potential for further improvements in model quality under similar constraints of active and total parameter count. However, it is important to keep in mind that increasing the number of selected experts, while beneficial for model quality, can increase the MoE all-to-all communication costs making it difficult to achieve high training efficiency.

Figure 3: Loss comparison between Top-1 and Top-2 gating with the same amount of active parameters and total parameters.

MoE Layer Frequency Selection

Another common question for MoE models is the frequency of MoE layers, e.g.,

  1. Every layer MoE (replace all FFN layers with MoE layers) or
  2. Every other layer MoE (interleave dense FFN with MoE layers) or
  3. … so on ….
Figure 4: MoE layer frequency study with similar amount of total parameters.

Key Intuition

A key takeaway from the above two studies is that the method of combining experts between consecutive layers is crucial to the final model quality. For instance, in the base setup for a two layer model with E experts and Top-1 gating, there are E² possible combinations of experts. If we increase the number of experts to 2E experts and apply Top-2 gating — while keeping the total compute and parameter count unchanged — the number of possible expert combinations jumps to 4E⁴ offering more choices and potentially enhancing model quality as shown in Figure 3. On the other hand, if Top-1 is used with ‘every other layer’ MoE, the number of possible combinations drops to 2E, resulting in a weaker model as shown in Figure 4.

Preview of Arctic Dense-MoE Hybrid Architecture

We now have a deeper understanding of the MoE architecture design space and the associated quality trade-offs.

However, selecting an optimal architecture requires looking beyond these quality trade-offs.
Training and deploying an MoE model introduces a plethora of system challenges. It is critical to co-design the MoE architecture with the system in mind, such that these challenges can be addressed in a holistic way.

Figure 5. Comparing Dense, traditional MoE, and Dense-MoE Hybrid Transformer Architectures

Arctic uses a Dense-MoE Hybrid Architecture (shown in Figure 5), using Top-2 gating with 128 experts to strike a balance between quality improvement, and system efficiency. In our upcoming blogs, we will delve into the training and inference challenges faced by large MoE models like Arctic. We will also discuss how we co-designed the Arctic hybrid architecture to achieve high quality while effectively addressing these challenges. Please stay tuned for more insight.

Learn more in our Snowflake Arctic series

Check out our other blog posts that dive into Snowflake Arctic training, including the data cleaning, training system design, modeling and system design for optimal throughput, etc. Stay tuned as more updates will continue to drop in the Snowflake Arctic cookbook catalog.

--

--