OLMoE: An open, small, and state-of-the-art mixture-of-experts model

Ai2
Ai2 Blog
Published in
3 min readSep 4, 2024
A logo for OLMoE.

We’re introducing OLMoE, jointly developed with Contextual AI, which is the first mixture-of-experts model to join the OLMo family. OLMoE brings two important aspects to the space of truly open models — it is the first model to be on the Pareto frontier of performance and size, while also being released with open data, code, evaluations, logs, and intermediate training checkpoints. Over the last few years, mixture-of-experts architectures have become a core technology used by closed AI labs to more efficiently serve and train leading language models (LMs). We’ve seen similar gains in our training stack, where this MoE model trained 2x faster than equivalent dense models.

OLMoE is a sparse MoE model with 1 billion active and 7 billion total parameters. It was trained on 5 trillion tokens built on a new data mix incorporating lessons from Ai2’s Dolma and building heavily on DataComp-Baseline. We refer to the dataset as OLMoE-Mix. We performed extensive experimentation on many crucial MoE details, including the routing algorithm, auxiliary loss functions, and sparse upcycling. A comparison of the model to other Ai2 OLMo models and models of similar size categories is below.

OLMoE-1B-7B is the state of the art among models with 1B active parameters and outperforms various larger models.

The release of this model is also accompanied by a preview version of our new Tulu 3 post-training pipeline. This version includes additional instruction data from HuggingFace’s No Robots human data, math, and code data, and a subset of Nvidia’s Daring Anteater synthetic data. This mix gives noticeable improvements across math, code, and instruction following evaluations, which is then passed into the standard UltraFeedback preference tuning with Direct Preference Optimization (DPO). A comparison of the gains from supervised fine-tuning (SFT) and DPO is shown below.

After instruction and preference tuning OLMoE outperforms various prior larger MoEs including DeepSeek, Qwen and JetMoE.

We are releasing many variants and checkpoints of this model to enable multiple directions of LM research.

  • 244 checkpoints for the pretrained model, one every 5000 steps.
  • The annealed and unannealed checkpoints.
  • Fine-tuned versions on both the annealed and unannealed base models.
  • Fine-tuned versions with and without load balancing through the experts.

For more details, check out:

Follow @allen_ai on Twitter/X, and subscribe to the Ai2 Newsletter to stay current on news and research coming out of Ai2.

--

--

Ai2 Blog
Ai2 Blog

Published in Ai2 Blog

Breakthrough AI to solve the world’s biggest problems.

Ai2
Ai2

Written by Ai2

Our mission is to build breakthrough AI to solve the world's biggest problems. We are a Seattle-based non-profit founded in 2014 by Paul G. Allen.

No responses yet