Introduction to Mixtral 8x7B!

4 min readJan 9, 2024

Introduction

Have you heard about Mixtral 8x7B?
It’s the latest Sparse Mixture of Experts (SMoE) language model that’s making buzz! This model has the same architecture as Mistral 7B, but with one key difference: each layer consists of 8 feedforward blocks as experts.

What’s really impressive about Mixtral is its unique router network which selects two experts to handle the current state during token processing. This allows access to 47B parameters while utilizing only 13B active parameters during inference. And the best part? Mixtral outperforms other language models in various benchmarks, such as mathematics, code generation, and multilingual tasks. So let's dive in!

Code and Availability

You can find the code for Mixtral 8x7B’s at https://github.com/mistralai/mistral-src.

Both Mixtral 8x7B and Mixtral 8x7B — Instruct are released under the Apache 2.0 license, so they can be used for academic and commercial usage.

Architectural Details

The Mixtral model is built on a transformer architecture that has been modified to allow a fully dense context length of 32k tokens. The model’s parameters, such as dimensions, layers, heads, and other relevant values, have been carefully chosen to optimize performance. The Sparse Mixture of Experts (SMoE) layer is a critical component that enables controlled cost and latency by selecting two experts per token.

Sparse Mixture of Experts

In simple terms, the Mixture of Experts layer is responsible for determining the output of the model based on the sum of expert network outputs, guided by the gating network. Mixtral has come up with a brilliant idea to use a sparse version of this layer. In this approach, a router network selects two experts for token processing at each layer. This technique is super cool as it increases the model’s parameter count while maintaining computational efficiency. So, you get more benefits without costing much.

Results

Mixtral’s performance is thoroughly compared with Llama models across various benchmarks, and the results are impressive! Even though it uses 5x fewer active parameters during inference, Mixtral outperforms or matches Llama-2–70B in areas like mathematics and code generation.

Multilingual Benchmarks

Mixtral is great at multilingual tasks too! It uses a unique pretraining methodology that involves the upsampling of multilingual data, and it has made significant advancements in French, German, Spanish, and Italian.

Long Range Performance

According to the research paper, Mixtral is pretty impressive when it comes to long-distance performance. It achieved a perfect score in the passkey retrieval task and on the proof-pile dataset, the perplexity decreased as the context length increased.

Bias Benchmarks

And here’s another thing — Mixtral presents less bias than Llama 2 70B! This was shown in evaluations on bias benchmarks like BBQ and BOLD, where Mixtral exhibited more positive sentiments and achieved higher accuracy on BBQ.

Instruction Fine-tuning

Mixtral-instruct is trained using supervised fine-tuning on an instruction dataset and Direct Preference Optimization (DPO) on a paired feedback dataset. It’s really good at understanding instructions and feedback, which is why it outperforms other models on the MT-Bench benchmark and has the highest Arena Elo rating on the LMSys Leaderboard.

Routing Analysis

An analysis of expert selection by the router reveals that, during training, experts don’t specialize in specific domains, but show structured syntactic behavior that is more closely related to syntax than domain. So, the selection process is more about syntax than domain, especially at the initial and final layers. Plus, it seems like there’s a tendency to stick with the same experts for consecutive tokens, especially at higher layers.

Conclusion

To wrap up, Mixtral 8x7B is a pretty cool language model. It’s really impressive and has amazing performance, with a well-thought-out design that balances power and efficiency. I’m really excited to see how the Mixtral will be embraced by the community.

If you’re interested, you should definitely check out the code, play around with it, and contribute to the ongoing discussion around language models.

And hey, don’t forget that the journey doesn’t end here! Language models like Mixtral are always evolving and changing the way we process natural language, so there’s always more to discover and explore.