Putting Open Source AI in Your Pocket
Ollama + Quantisation = AI That Can Chat on Your Laptop
In my previous post, I walked through quantising the 7 billion parameter Llama-2 model down to 4-bit precision to reduce its size for local deployment on a consumer laptop. This compressed the model from 30GB down to 3.8GB while retaining high accuracy.
Shrinking Giants: Adapting Open Source AI for Everyday Devices
Now, I will demonstrate deploying this 4-bit quantised Llama model locally for chat applications using Ollama. By importing my optimised model into Ollama and running inferences, we can see first-hand the performance gains of quantisation.
As demonstrated in this post, Ollama provides flexibility to deploy optimised models from external sources by importing a custom quantised model. It’s also worth noting that Ollama offers several pre-optimised models available out of the box for even easier on-device deployment. However, I could fine-tune the precision and performance trade-offs for my specific hardware constraints by quantising and importing an external model myself.
How do you run a quantised model locally?
The answer is using ollama.ai. Ollama is a tool to get up and running locally with large language models, and it’s simple to use.
For MacOS Download Ollama from this link
https://ollama.ai
Create a local direction to store your quantised models
mkdir models
In this example, we will use the quantised model from the last blog post.
Llama-2–7b-chat-hf-Q4_K_M.gguf
Create a file called ModelFile with a From instruction.
FROM Llama-2–7b-chat-hf-Q4_K_M.gguf
There are other parameters which you can use. For more details, check “https://github.com/jmorganca/ollama"
Create the model in Ollama.
ollama create Llama27bchat -f Modelfile
Run the model
ollama run Llama27bchat
Here, you have a simple chat interface for the prompt and res
You can also access Ollama as a server call.
r = requests.post(
"http://0.0.0.0:11434/api/chat",
json={"model": model, "messages": messages, "stream": True},
)
Conclusion
By combining model quantisation techniques with user-friendly tools like Ollama, the latest AI conversational capabilities can now be run on standard consumer hardware like laptops.
This allows personal use cases to benefit from innovations previously only available through cloud APIs. Performance is more than adequate for applications like chatbots, question-answering, semantic search, etc.
Improvements around model optimisations and tools simplifying the experience would further increase the adoption of on-device AI. But this shows the potential for democratised access, as barriers around computing constraints and complexity get solved through quantisation and wrappers like Ollama.
Interested in the intersection of Generative AI and retail? Share your thoughts in the comments below, or reach out for a deeper discussion.
david@davidkolbconsultancy.com
More Links