RAG on the Best Open Source Mixtral offline — Windows (CPU) on Mixtral 56B Parameters

Mollel Michael (PhD)
7 min readDec 18, 2023

--

Notebook link: https://github.com/msamwelmollel/Windows-CPU-on-Mixtral-56B-Parameters

In this article, I am bringing what is currently best in the open-source community. I explored one of the powerful models presently available, Mixtral 8*7B, which is equivalent to the 56B parameter. Instead of using this 56B parameter all during inference, the model uses an advanced technique called Mixture of Expertise (MOE) and uses only the equivalent 12B parameter while unleashing the might of power learning in all 56B parameters. This article does not cover details about MOE, but we are looking at how to use it offline on your CPU (windows).

Why Mixtral

Even with only 12 billion parameters, Mixtral taps into expansive knowledge by selectively activating different subsets of its parameters for each query. This enables wide-ranging reasoning without the computing costs of estimating 56B parameters. By running Mixtral locally on a CPU, we can get powerful insights without relying on external servers for inference.

Why this tutorial?

I aim to provide a friendly introduction to anyone looking to get the most out of modern machine-learning techniques. I skip the academic jargon and math to focus on simple, reproducible examples of high-performance question-answering with Mixtral instead. The tutorials I develop will equip readers to build intelligent assistants tailored to their needs, all running fully offline on consumer hardware. No specialized skills are needed—bring your curiosity!

My PC Specification

  1. Memory: I am using a Dell XPS 15 7590 with the following specifications:
Dell XPS 15 7590 Specification

2. Python version and library: This tutorial uses Python 3.10.12

3. Mixtral Model: The Mixtral 8x7B model used in this article is graciously provided by The Bloke on Hugging Face, available at this link: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF. I have opted to use the smallest version of the model to avoid overtaxing my PC RAM limitations. However, feel free to experiment with the larger variants if you have sufficient memory, as they may enable more performant and accurate inference. The Bloke’s contributions make state-of-the-art ML accessible to all—huge thanks for publicly sharing Mixtral! My experiments build off their foundational work in distilling such a powerful yet efficient model.

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q2_K.gguf

Use Cases and Applications: As a foundation model, Mixtral can enable a wide range of downstream applications beyond its pre-training objectives. In this article, I focus on one famous use case—retrieving answers from custom documents using the RAG (Retrieval Augmented Generation) framework. RAG leverages efficient vector search to identify relevant passages for answer extraction.

Specifically, I showcase integrating Mixtral with the llama-index library to index personal datasets for fast nearest-neighbor lookups. This powers a chat interface tapping into niche documents unavailable in Mixtral’s standard knowledge. While question answering is just one demonstration, Mixtral’s versatility can extend to text translation, content generation, summarization, and other NLP tasks. I aim to provide concrete examples of setting up performant RAG applications with Mixtral and llama-index as the building blocks. I hope to equip readers to apply foundation models like Mixtral to customize intelligent agents for their unique needs, accessing specialized corpora.

Step-by-Step Guide: In this part, let's dive into the step-by-step installation and implementation of the RAG.

  1. Install the necessary libraries as follows (use pip install or! pip install in native Windows or Linux style, respectively): a)sentence_transformers b) llama_index c) langchain

Note: You need LLAMA-CPP-PYTHON VERSION v0.2.23 OR HIGHER

RUN: pip install — upgrade llama_cpp_python

2. Import the libraries: the libraries add the necessary tools that can be used for RAG. The critical library enabling RAG functionality is llama-index, which provides an easy-to-use wrapper for efficient vector indexing and nearest-neighbor search. By handling low-level details like data structures and similarity calculations under the hood, llama-index simplifies the process of powering chatbots with custom document retrieval.

I recommend further exploring the llama-index documentation to apply the techniques covered to even more advanced use cases. It outlines additional features for tailoring index construction, storage formats, trailing layers, and inference options. Adapt these higher-level controls to serve your specific retrieval and routing needs best.

import logging
import sys
from sentence_transformers import SentenceTransformer

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import ServiceContext, set_global_service_context
from llama_index.embeddings.langchain import LangchainEmbedding

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import messages_to_prompt, completion_to_prompt

3. Load data from your directory: add the path to your data where you store PDFs and other documents.

documents = SimpleDirectoryReader("data/").load_data()

The document is in PDF format and contains basic information that is not comprehensive in describing myself.

4. Load the Mixtral 8x7B model named “mixtral-8x7b-instruct-v0.1.Q2_K”

llm = LlamaCPP(
# Optionally, you can pass in the URL to a GGML model to download it automatically
# model_url='mistral-7b-instruct-v0.1.Q4_K_M.gguf',
# Set the path to a pre-downloaded model instead of model_url
model_path='mixtral-8x7b-instruct-v0.1.Q2_K.gguf',
temperature=0.1,
max_new_tokens=256,
context_window=3900,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": -1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)

5. Load the embeddings; depending on your use case, you can choose any embedding that performs better for your data. When loading embeddings for the document index, you can experiment with different vectorizers tailored to your data domain. For datasets with more technical language or sparse vocabulary, consider SciBERT or BioBERT embeddings pre-trained on scientific papers and medical corpus. For casual dialog with more colloquial queries, T5-base or even GPT-3 embeddings may capture semantics better.

The llama indexing interface makes it easy to hot-swap different embeddings until you achieve optimum retrieval performance. Don’t be afraid to evaluate multiple embedding sources specific to your application, whether optimized for speed, accuracy, or memory. Every dataset poses unique challenges. Leverage existing embeddings whenever possible rather than training custom vectors from scratch.

embed_model = LangchainEmbedding(
HuggingFaceEmbeddings(model_name="thenlper/gte-large")
## you can select sentence transfomer embedding also
# HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)

6. Initiate the loading process for the ServiceContext and VectorStoreIndex components, which together comprise the retrieval-augmented generation (RAG) pipeline, leveraging both external knowledge and LLM understanding to respond to user input

service_context = ServiceContext.from_defaults(
chunk_size=256,
llm=llm,
embed_model=embed_model
)


index = VectorStoreIndex.from_documents(documents, service_context=service_context)

7. Finally, you can pose questions and converse with LLM to utilize their knowledge and understanding of the language to provide helpful information and dialogue.

query_engine = index.as_query_engine()
response = query_engine.query("Where did Michael attended his primary school?")

print(response)

Response:

Important Tips for Memory Usage:

  1. Before I loaded the quantized model, I cleaned up my memory and ended up with 23GB of available memory.

2. I utilized a maximum of 19GB memory for the notebook during inference. Therefore, I suggest ensuring a minimum of 20GB available memory to successfully run this 2-bit quantized model.

Conclusion
In this article, I have provided a beginner-friendly introduction to leveraging the powerful Mixtral model offline for customized question answering. By walking through integrating Mixtral with llama-index for efficient retrieval over personal datasets, I aimed to make state-of-the-art foundation models approachable even for non-experts.

The step-by-step tutorial covered:

  1. Setting up the environment with relevant libraries like llama-index, sentence-transformers, and langchain
  2. Loading custom documents and indexing for fast nearest neighbor search
  3. Initializing Mixtral 8x7B and embedding models
  4. Creating a RAG pipeline with the VectorStoreIndex and ServiceContext
  5. Issuing sample queries to answer questions via retrieved passages

I sought to equip readers to apply similar techniques to create intelligent assistants tailored to their niche needs. The RAG framework offers incredible versatility beyond QA as well—you can customize chatbots to translate text, generate content, summarize documents, and more in your specialized domain.

I highly recommend exploring the llama-index documentation further for producing retrieval-augmented models like Mixtral effectively. The simple examples within aim to ignite your imagination for building helpful, ethical AI solutions enhanced by foundation models. Let me know in the comments if you have any other use cases you’d like me to cover in future tutorials.
Notebook link: https://github.com/msamwelmollel/Windows-CPU-on-Mixtral-56B-Parameters

--

--