Running Mixtral 8x7b on M1 16GB

Nikita Kiselov
2 min readJan 13, 2024

Mistral AI has revolutionized the landscape of artificial intelligence with its Mixtral 8x7b model. Comparable to GPT3.5 in terms of answer quality, this model also boasts robust support for languages like French or German. It’s especially impressive that now we can run a substantial 47B model on a 16GB M1 Pro laptop.

Essentially, Mixtral 8x7B is a Mixture of Experts (MoE) model. It utilizes an array of smaller, rapid 7B models in place of a singular large model, ensuring both speed and efficiency in processing. Mixtral’s router network selectively engages two experts per token at each layer, allowing access to 47B parameters while actively utilizing 13B during inference. This approach, combined with a 32k token context size, optimizes performance and efficiency.

What about running on only 16GB ? Obviously, we will use llama.cpp but with a caveat :) We can’t just run it like this, we need to quantise it (compress the parameter accuracy). Here is where one solution is shining — QuIP. This is a new SOTA method for a 2bit!!! quantisation, which allows such crazy compression with a relatively small loss in quality.

So, let’s start!

Install llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

--

--

Nikita Kiselov

Applied Scientist 👨‍🔬 | MSc AI @ Université Paris-Saclay 🇫🇷 | Ukrainian 🇺🇦