How to Run LLMs Locally with Ollama

Roy Ben Yosef
CyberArk Engineering
8 min readMar 27, 2024
Llama manipulating models as light orbs
Generated with ChatGPT

LLMs (large language models) are everywhere nowadays, and even if you are living under a rock, you’ve definitely heard about ChatGPT. Although there are many commercial models available such as OpenAI’s GPT models, Anthropic’s Claude, Google’s Gemini, there are also many open-source models available that you can use and run on your machine.

There are many ways you can run open-source LLMs locally, but I want to focus on my favorite way — using Ollama. We can either use Ollama’s curated models, or bring in custom models.

Running open-source models has many benefits, including:

  • Running models that are not available as a service elsewhere.
  • Increased privacy. When privacy is important and you want to make sure your data is kept 100% private.
  • Reduced costs. You might be able to cut expenses by running a smaller model that can satisfy your requirements and perform well on your tasks.

Open-Source LLMs: Does Size Matter?

In this fast-paced world of AI innovation, we witness a flurry of open source models, both larger (e.g. Llama2 70B) and smaller models. These open source models are becoming more and more efficient as well. For example, some of the small models require less hardware and consume less energy, and are still considered to perform very well despite their size.

The reason to run a smaller model is that they require fewer compute resources to run, spend less energy, and are generally more efficient. You can also control their scale instead of depending on the provider allowances. If these models also perform well for the tasks that you use them for then it is a great win.

A few notable and popular open-source models are:

  • Llama 2 70b: a general purpose 70 billion parameter model by Meta.
  • Phi-2: a small, 2.7 billion parameter by Microsoft, that is considered powerful relative to its size. Recommended to be used on weaker hardware.
  • Mistral 7b: a small, 7 billion parameters model by Mistral.
  • Mixtral 8x7B: A “mixture-of-experts” model, considered very powerful.

As Mistral themselves put it:

“Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model.”

There are a vast number of other open source models, each trained on different data that use different techniques meant to let it perform well while still keeping as small of a size as possible.

The Advantages of Running Ollama

Ollama is an extremely popular open-source project that lets you run large language models locally.

I highly recommend it for running LLMs locally because among other things:

  • It is actively maintained and updated.
  • It has a large and active community which you can join on discord.
  • It offers a docker container which you can run if you prefer not to install it locally.
  • It is simple to use.
  • It supports many models that you can run, new models are added all the time and you can bring in custom models as well.
  • It has an excellent 3rd party, open-source frontend called “Ollama Web-UI” that you can use.
  • It supports multi-modal models.

You can find the project on GitHub.

For the front-end, we are going to use Ollama Web-UI.

Running Ollama Locally in 4 Steps

1. Setting Up Ollama

You can easily install Ollama locally on MacOS and Linux. Windows is available in preview, or, it can be installed on Windows via WSL2 (Linux on Windows).

I recommend running in a Docker container. This makes it easy to have the installation independent of your machine.

Both Ollama and Ollama Web-UI have Docker images that we’re going to use.

You can find the full installation instructions for Ollama in the repo and the Docker hub page and for Ollama Web-UI here.

NOTE: All commands below that are running ollama, assume you’re using Docker, otherwise drop the “docker exec -it ollama” part.

Prerequisites

Decide if you want to run Ollama with or without a GPU:

Install Ollama without a GPU

If you want to run using your CPU, which is the simplest way to get started, then run this command:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

This will run a docker container with Ollama, map a volume for the models that we’re going to use and map port (11434) that we can connect to from the front-end.

Install Ollama with a GPU

If instead, you want to leverage your GPU (which makes sense if you have a powerful GPU), then do the following:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker # on windows - restart the docker engine from the windows host instead

On windows, the commands run in the WSL2 terminal. Instead of running the “restart docker” command, simply restart the docker engine from the host.

  • Start the container (with GPU):
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

When you run the models, you can verify that this works by checking GPU usage during the model operation (e.g. in Windows use task manager).

Verifying Ollama is Running

Verify that your container is up by running the help command in Ollama. Note that the first “ollama” is the container name, and the second one is the cli command:

docker exec -it ollama ollama help
Ollasma help command output
Ollama help command output

2. Running Ollama Web-UI

According to the documentation, we will run the Ollama Web-UI docker container to work with our instance of Ollama. There are other ways, like using docker-compose, which you can see in the docs.

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama-webui/ollama-webui:main

This command will among other things, run the Ollama Web-UI container named “ollama-webui”, map the container port 8080 to the host port 3000 and map a volume to store files persistently.

  • You can test it by going to http://127.0.0.1:3000/
  • Create an account (it’s all local) by clicking “sign up” and log in.

3. Picking a Model to Run

Let’s head over to Ollama’s models library and see what models are available. Each model page has information about the model, including a link to the Hugging Face page with a lot of additional information and resources.

You can think of Hugging Face as the GitHub for LLMs (but it is much more than that)

We are going to run Mixtral 8x7b which is considered small but powerful:

docker exec -it ollama ollama run mixtral

The first time you run it, it has to pull the model, so .it might take some time depending on the size of the model you’re pulling.

We could also do the same from the UI, by clicking the gear button next to the model selection:

Web-UI Model Selection
Web-UI Model Selection

This allows you to manage models:

Ollama Web UI Model Settings
Ollama Web UI Model Settings

About System Requirements

You might ask yourself: what system do I need to run these models?

There isn’t a clear-cut answer to this as this issue suggests. Each model is different and larger models are not hardware intensive. You can harness your GPU or your CPU, and you will need RAM to load the model.

Some models have requirements like RAM requirements, but not all. The best way is to give it a try and see for yourself.

4. Using the Model

I selected the “mixtral” mode, and asked it to write an interesting piece of Python code.

Prompting Mixtral for Interesting Python Code
Prompting Mixtral for Interesting Python Code

Nice! But trying to run it takes forever…here’s the code, in case you want to see if you can find the problem yourself 😊.

Worry not! Let’s ask mixtral to fix it for us:

“This code takes forever to run, please fix it”

Prompting Mixtral to Fix its Interesting Python Code
Prompting Mixtral to fix its interesting Python code

The code can be found here.

And this time it works. How lovely!

Random Maze Generation Output
Random maze generation output

A Word on Storage

These models are not light on storage. The volumes that Docker creates are persistent, so if you’re worried about free space, you need to make sure to clean up when you’re done.

You can view storage information by running these commands:

docker system df
docker system df -v
docker volume information
Information on docker volume

Using Multi-Modal Models

Ollama Multi Modal Models
Ollama multi modal models

Ollama even supports multi-modal models, such as, for example, those that have “vision” capabilities, like in the image above 😁.

Let’s give the llava 34b model a go:

docker exec -it ollama ollama run llava:34b

Ollama Web-UI also supports multi-modal models. You can even paste the image into the text field!

I’ll refresh the UI, select the “Llava” model, and then, ask the model what it sees:

Ask Llava to explain my dog-in-a-box picture
Asking Llava to explain my dog-in-a-box picture

That is so cool!

Indeed it is my dog Gizmo chilling in the box. Too cool.

A note on security: Never ever use your pet name as an answer to security questions. That goes against common sense 😊.

Loading Additional Models

You can also load models that are not part of the Ollama model library.

This is done by loading a GGUF format model, as described in the docs. GGUF is a modern language model file format, and you can read more about it here.

One place you can get GGUF format models is from TheBloke on Hugging Face which curates thousands of such models. Look for the GGUF section of models you are interested in.

You could even convert models to GGUF on your own if you wish, but using TheBloke models saves you the trouble.

A note on quantization: If you do look for custom models, then you will notice that each model has several versions. This is usually due to different quantization methods, which, in simple terms, means the way that the model was optimized for use. This can result in smaller model versions that are easier to run and still are (hopefully) accurate enough.

Summary: Using Ollama To Run Local LLMs

Today, more open-source models with great capabilities are released constantly each day. Ollama and Ollama Web-UI allow you to easily run such models on your machine with your hardware, without depending on service providers. While it’s fun to play with these models, they also give you the ability to run any open-source model you want and gain full control and privacy over the software that you run while doing so.

--

--

Roy Ben Yosef
CyberArk Engineering

Sr. Software architect at CyberArk’s Technology Office. Into code, architecture and problem solving. Like to build and fix stuff. Usually late at night.