Running models with Ollama step-by-step

Gabriel Rodewald
7 min readMar 7, 2024

--

Looking for a way to quickly test LLM without setting up the full infrastructure? That’s great because that’s exactly what we’re about to do in this short article.

Llama2:70B-chat from Meta visualization. Image created using https://www.bing.com/images/create ☘️

You can skip to a specific paragraph if you’ve had previous experience with Ollama. What you can find in this article:

  1. What is Ollama?
  2. Installing Ollama on Windows.
  3. Running Ollama [cmd].
  4. Downloading models locally.
  5. Different models for different purposes.
  6. Running models [cmd].
  7. CPU-friendly quantized models.
  8. Integrating models from other sources.
  9. Ollama-powered (Python) apps to make devs life easier.
  10. Summary.

1. What is Ollama?

Ollama is an open-souce code, ready-to-use tool enabling seamless integration with a language model locally or from your own server. This allows you to avoid using paid versions of commercial APIs, especially now that Meta has made Llama2 models available for commercial use — perfect for further training on your own datasets.

➡️ GitHub respository: https://github.com/ollama/ollama

➡️ Ollama official webpage: https://ollama.com

Image source: https://ollama.com

2. Installing Ollama on Windows

Ollama seamlessly works on Windows, Mac, and Linux. This quick tutorial walks you through the installation steps specifically for Windows 10. After installation, the program occupies around 384 MB. However, keep in mind that downloaded models may not be as lightweight.

If you prefer to run Ollama in a Docker container, skip the description below and go to

➡️ https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image

➡️ Go to Ollama and download .exe file: https://ollama.com

Download Ollama and install it on Windows. You have the option to use the default model save path, typically located at:

C:\Users\your_user\.ollama

However, if space is limited on the C: partition, it’s recommended to switch to an alternative directory. If you have another partition like D:\, simply:

  1. Right-click on the computer icon on your desktop.
  2. Choose Properties, then navigate to “Advanced system settings”.
  3. Click Environment variables.
  4. In User variables for … insert the absolute path to the directory where you plan to store all models. For example:
Variable: OLLAMA_MODELS
Value: D:\your_directory\models

Do not rename OLLAMA_MODELS because this variable will be searched for by Ollama exactly as follows.

An Ollama icon will appear on the bottom bar in Windows. If the program doesn’t initiate, search for it in Windows programs and launch it from there.

Ollama running in background on Windows 10

Now you are ready torun Ollama and download some models :)

3. Running Ollama [cmd]

Ollama communicates via pop-up messages.

Once Ollama is set up, you can open your cmd (command line) on Windows and pull some models locally.

Ollama local dashboard (type the url in your webbrowser):

http://localhost:11434/api/

Running Ollama is simple as that. Later you will see how to utilize it both via CMD and from Python code.

A few key commands:

To check which models are locally available, type in cmd:

ollama list

To check which SHA file applies to a particular model, type in cmd (e.g. for instance, checking llama2:7b model):

ollama show --modelfile llama2:7b

To remove a model:

ollama rm llama2:7b

To server models:

ollama serve

4. Downloading models locally.

On the website ➡️ https://ollama.com/library, you’ll find numerous models ready for download, available in various parameter sizes.

Before downloading a model locally, check if your hardware has sufficient memory for loading it. For testing, it’s advisable to use small models labeled as 7B — they are adequate for integrating the model into applications.

⚠️ It is strongly recommended to have at least one GPU for smooth model operation.

Below, you’ll find several models I’ve tested and recommend. Copy and paste the commands into your command prompt to pull the specified model locally.

👉Llama2 models from Meta

A set of generative text models optimized for dialogue scenarios. Like many models on Ollama, Llama2 is offered in various configurations:

Image source: https://ollama.com/library/llama2:chat

Below are just a few examples of how to pull such models:

Standard model:

ollama pull llama2

Uncensored version:

ollama pull llama2-uncensored:7b

Chat 7B model:

ollama pull llama2:7b-chat

➡️ Read more: https://llama.meta.com/llama2

👉Gemma from Google

An open-source model delivering robust performance comparable to leading 7B weight models.

ollama pull gemma:7b

➡️ Read more: https://blog.google/technology/developers/gemma-open-models/

👉LLaVa from Haotian Liu et al.

A multimodal model that excels in handling image-to-text descriptions while providing robust support for both vision and language models.

ollama pull llava

➡️ Read more: https://llava-vl.github.io/

5. Different models for different purposes

Llama2:70B-chat from Meta visualization. Image created using https://www.bing.com/images/create ☘️

Some models were trained on specific datasets, making them better suited for particular tasks such as code completion, conversation, or image-to-text processing. On Ollama, you’ll find models designed for various purposes.

The first group is focused on facilitating conversations, text completion, summarizations, including models like Gemma, Llama2, Falcon, or OpenChat.

Some examples:

➡️ https://ollama.com/library/falcon

➡️ https://ollama.com/library/gemma

➡️ https://ollama.com/library/openchat

The next group comprises multimodal models capable of engaging in conversations, acting as chatbots, describing images (vision models), summarizing texts, and powering Question-Answer (Q/A) applications.

Some examples:

➡️ https://ollama.com/library/llava

➡️ https://ollama.com/library/bakllava

The last, highly specialized group supports developers’ work, featuring models available on Ollama like codellama, doplhin-mistral, dolphin-mixtral (‘’fine-tuned model based on the Mixtral mixture of experts model that excels at coding tasks’’) , and many more continuously added by creators.

Some examples:

➡️ https://ollama.com/library/codellama

➡️ https://ollama.com/library/dolphin-mistral

➡️ https://ollama.com/library/dolphin-mixtral

6. Running models [cmd]

To run downloaded model, simply type ollama run model_name:params “your prompt” , for instance:

ollama run llama2:7b "your prompt"

Multimodal models allow you to include files, paths to local images, and more, extending beyond a basic prompt.

6. CPU-friendly quantized models

Quantization is all about reducing the weights of costs of loosing model’s precision. Detailed explanation can be found in this great article that leads users step by step helping to build up intuition laying behind the process:

📃 What are Quantized LLMs? (by Miguel Carreira Neves) :

➡️ https://www.tensorops.ai/post/what-are-quantized-llms

Additional reading:

📃 Extreme Compression of Large Language Models via Additive Quantization:

➡️ https://arxiv.org/html/2401.06118v2

📃SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models:

➡️ https://arxiv.org/pdf/2211.10438.pdf

📃 BiLLM: Pushing the Limit of Post-Training Quantization for LLMs:

➡️ https://arxiv.org/pdf/2402.04291.pdf

In simple terms, quantization adjusts weight precision, decreases model size, and allows running on less powerful hardware without significant accuracy loss.

In the image accompanying this article, you can observe that post-quantization, the models occupy considerably less space than in their original version:

Image source: https://www.tensorops.ai/post/what-are-quantized-llms

Ollama supports quantized models, relieving you from the burden of processing them independently.

7. Integrating models from other sources

Llama2:70B-chat from Meta visualization. Image created using https://www.bing.com/images/create ☘️

Although the models on Ollama offer versatility, not all of them are currently accessible. However, integrating your own model locally is a straightforward process. Let’s explore how to incorporate a new model into our local Ollama.

Numerous quantized models are available on The Bloke’s HuggingFace account. For medical papers, we can conveniently opt for the medicine-chat-GGUF model at:

➡️https://huggingface.co/TheBloke/medicine-chat-GGUF

Open the aforementioned link and click Files and versions.

Image source: https://huggingface.co/TheBloke/medicine-chat-GGUF/tree/main

Download model that you would like to include to Ollama models:

Image source: https://huggingface.co/TheBloke/medicine-chat-GGUF/tree/main

Generate an empty file named Modelfile and insert the specified data below (ensure to substitute the path with the absolute path where the downloaded model is stored, for instance). The example is fundamental and can be expanded with multiple options such as the model’s temperature, system message, and numerous others. If necessary, eliminate the ‘#’ from the file to activate those options.

FROM D:\...\medicine-chat.Q4_0.gguf
# PARAMETER temperature 0.6
# SYSTEM """You are a helpful medicine assistant."""

Save the Modelfile. Then type in cmd:

ollama create model_name -f Modelfile

9. Ollama-powered (Python) apps to make devs life easier

Ollama running in background is accessible as any regular REST API. Therefore it is easy to integrate it withing an application using libraries like requests, or a bit more developed frameworks like FastAPI, Flask or Django.

Easy pip install for Ollama python package from

➡️ https://pypi.org/project/ollama/0.1.3:

pip install ollama

Generating embedding directly from Python code:

import ollama

embedding = ollama.embeddings(model="llama2:7b", prompt="Hello Ollama!")

By using simply CURL:

curl http://localhost:11434/api/embeddings -d '{
"model": "llama2:7b",
"prompt": "Here is an article about llamas..."
}'

To read more about Ollama endpoints, please visit:

➡️ https://github.com/ollama/ollama/blob/main/docs/api.md

Ollama has been seamlessly integrated into the Langchain framework, streamlining our coding efforts and making our work on the technical side even more straightforward:

➡️ https://python.langchain.com/docs/integrations/llms/ollama

Let’s appreciate the simplicity of creating an embedding:

# pip install langchain_community
from langchain_community.embeddings import OllamaEmbeddings


embed = OllamaEmbeddings(model="llama2:7b")
embedding = embed.embed_query("Hello Ollama!")

10. Summary

This article guides you through running models with Ollama step-by-step, offering a seamless way to test LLM without a full infrastructure setup.

Ollama, an open-source tool, facilitates local or server-based language model integration, allowing free usage of Meta’s Llama2 models. The installation process on Windows is explained, and details on running Ollama via the command line are provided.

The article explores downloading models, diverse model options for specific tasks, running models with various commands, CPU-friendly quantized models, and integrating external models. Additionally, Ollama-powered Python applications are highlighted for developers’ convenience.

--

--

Responses (4)