Putting Open Source AI in Your Pocket

David Kolb
3 min readJan 12, 2024

--

Ollama + Quantisation = AI That Can Chat on Your Laptop

A group of llama data scientists teaming up to symbolize open source collaboration.
Image Midjourney V6 | David Kolb

In my previous post, I walked through quantising the 7 billion parameter Llama-2 model down to 4-bit precision to reduce its size for local deployment on a consumer laptop. This compressed the model from 30GB down to 3.8GB while retaining high accuracy.

Shrinking Giants: Adapting Open Source AI for Everyday Devices

Now, I will demonstrate deploying this 4-bit quantised Llama model locally for chat applications using Ollama. By importing my optimised model into Ollama and running inferences, we can see first-hand the performance gains of quantisation.

As demonstrated in this post, Ollama provides flexibility to deploy optimised models from external sources by importing a custom quantised model. It’s also worth noting that Ollama offers several pre-optimised models available out of the box for even easier on-device deployment. However, I could fine-tune the precision and performance trade-offs for my specific hardware constraints by quantising and importing an external model myself.

How do you run a quantised model locally?

The answer is using ollama.ai. Ollama is a tool to get up and running locally with large language models, and it’s simple to use.

For MacOS Download Ollama from this link

https://ollama.ai

Create a local direction to store your quantised models

mkdir models

In this example, we will use the quantised model from the last blog post.
Llama-2–7b-chat-hf-Q4_K_M.gguf

Create a file called ModelFile with a From instruction.

FROM Llama-2–7b-chat-hf-Q4_K_M.gguf

There are other parameters which you can use. For more details, check “https://github.com/jmorganca/ollama"

Create the model in Ollama.

ollama create Llama27bchat -f Modelfile

Run the model

ollama run Llama27bchat

Here, you have a simple chat interface for the prompt and res

You can also access Ollama as a server call.

 r = requests.post(
"http://0.0.0.0:11434/api/chat",
json={"model": model, "messages": messages, "stream": True},
)

Conclusion

By combining model quantisation techniques with user-friendly tools like Ollama, the latest AI conversational capabilities can now be run on standard consumer hardware like laptops.

This allows personal use cases to benefit from innovations previously only available through cloud APIs. Performance is more than adequate for applications like chatbots, question-answering, semantic search, etc.

Improvements around model optimisations and tools simplifying the experience would further increase the adoption of on-device AI. But this shows the potential for democratised access, as barriers around computing constraints and complexity get solved through quantisation and wrappers like Ollama.

--

--

David Kolb

Innovation Strategist & Coach | Cyclist 🚴‍♀️ | Photographer 📸 | IDEO U Alumni Coach