Deep Dive: Mixture of Experts Explained

Omer Khalid, PhD
3 min readApr 4, 2024

How MoE LLMs differ from Dense LLMs?

Earlier this week, I talked about Databricks release of DBRX LLM and its implication on the enterprise landscape. DBRX also happens to be a Master of Experts (MoE) LLM built on MegaBlocks research and this type of LLM architecture has been gaining traction lately so I thought to publish a deeper dive into MoEs.

Overview

Master of Experts (MoE) is a novel architecture for large language models (LLMs) that aims to improve their efficiency, scalability, and performance. This approach involves decomposing the model into multiple smaller expert models, each specializing in a specific task or domain.

The MoE architecture consists of two main components:

1. Experts: These are the individual models or sub-networks that are trained to become specialists in specific areas. For example, there could be experts for natural language understanding, question answering, mathematical reasoning, and so on.

2. Router (or Gating Network): The router is a component responsible for selectively activating the relevant experts based on the input data. It learns to route the input to the most appropriate experts, ensuring that only a subset of experts is used for a given input, rather than the entire model.

MoE layer from the Outrageously Large Neural Network paper

--

--