Mixtral: Generative Sparse Mixture of Experts in DataFlows

Tim Spann
Cloudera
Published in
5 min readMar 8, 2024

Mixtral-8x7B-Instruct-v0.1

“The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.”

So when I saw this come out it seemed pretty interesting and accessible. I gave it a try. With the proper prompting it seems good. I am not sure if it’s better than Google Gemma, Meta LLAMA2, or OLLAMA Mistral for my use cases.

URL

https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1

This model can be run by the lightweight serverless REST API or the transformers library. You can also use https://github.com/vllm-project/vllm. The context can have up to 32k tokens. You can also enter prompts in English, Italian, German, Spanish and French.

To Build Your Prompts Optimally, There are Some Guides

So construction the prompt is very critical to make this work well. So we are building this with NiFi.

Prompt Template

{ 
"inputs":
"<s>[INST]Write a detailed complete response that appropriately
answers the request.[/INST]
[INST]Use this information to enhance your answer:
${context:trim():replaceAll('"',''):replaceAll('\n', '')}[/INST]
User: ${inputs:trim():replaceAll('"',''):replaceAll('\n', '')}</s>"
}

Added a Filter for NSFW

So I added a call to:

As part of our prompt engineering to filter out NSWF texts from Slack.

Slack Response Template

===============================================================================================================
HuggingFace ${modelinformation} Results on ${date}:

Question: ${inputs}

Answer:
${generated_text}

=========================================== Data for nerds ====

HF URL: ${invokehttp.request.url}
TXID: ${invokehttp.tx.id}

== Slack Message Meta Data ==

ID: ${messageid} Name: ${messagerealname} [${messageusername}]
Time Zone: ${messageusertz}

== HF ${modelinformation} Meta Data ==

Compute Characters/Time/Type: ${x-compute-characters} / ${x-compute-time}/${x-compute-type}

Generated/Prompt Tokens/Time per Token: ${x-generated-tokens} / ${x-prompt-tokens} : ${x-time-per-token}

Inference Time: ${x-inference-time} // Queue Time: ${x-queue-time}

Request ID/SHA: ${x-request-id} / ${x-sha}

Validation/Total Time: ${x-validation-time} / ${x-total-time}
===============================================================================================================

We use Pinecone for RAG.

An update from previous articles

Sometimes image processing fails so let’s pass through the original image.

RESOURCES

--

--

Tim Spann
Cloudera

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/