Mixtral: Generative Sparse Mixture of Experts in DataFlows
“The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.”
So when I saw this come out it seemed pretty interesting and accessible. I gave it a try. With the proper prompting it seems good. I am not sure if it’s better than Google Gemma, Meta LLAMA2, or OLLAMA Mistral for my use cases.
URL
https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1
This model can be run by the lightweight serverless REST API or the transformers library. You can also use https://github.com/vllm-project/vllm. The context can have up to 32k tokens. You can also enter prompts in English, Italian, German, Spanish and French.
To Build Your Prompts Optimally, There are Some Guides
So construction the prompt is very critical to make this work well. So we are building this with NiFi.
Prompt Template
{
"inputs":
"<s>[INST]Write a detailed complete response that appropriately
answers the request.[/INST]
[INST]Use this information to enhance your answer:
${context:trim():replaceAll('"',''):replaceAll('\n', '')}[/INST]
User: ${inputs:trim():replaceAll('"',''):replaceAll('\n', '')}</s>"
}
Added a Filter for NSFW
So I added a call to:
As part of our prompt engineering to filter out NSWF texts from Slack.
Slack Response Template
===============================================================================================================
HuggingFace ${modelinformation} Results on ${date}:
Question: ${inputs}
Answer:
${generated_text}
=========================================== Data for nerds ====
HF URL: ${invokehttp.request.url}
TXID: ${invokehttp.tx.id}
== Slack Message Meta Data ==
ID: ${messageid} Name: ${messagerealname} [${messageusername}]
Time Zone: ${messageusertz}
== HF ${modelinformation} Meta Data ==
Compute Characters/Time/Type: ${x-compute-characters} / ${x-compute-time}/${x-compute-type}
Generated/Prompt Tokens/Time per Token: ${x-generated-tokens} / ${x-prompt-tokens} : ${x-time-per-token}
Inference Time: ${x-inference-time} // Queue Time: ${x-queue-time}
Request ID/SHA: ${x-request-id} / ${x-sha}
Validation/Total Time: ${x-validation-time} / ${x-total-time}
===============================================================================================================
We use Pinecone for RAG.
An update from previous articles
Sometimes image processing fails so let’s pass through the original image.