Running LLaMA 3 Model with NVIDIA GPU Using Ollama Docker on RHEL 9

3 min readApr 24, 2024

Harnessing the power of NVIDIA GPUs for AI and machine learning tasks can significantly boost performance. This guide will walk you through the process of running the LLaMA 3 model on a Red Hat Enterprise Linux (RHEL) 9 system using Ollama Docker, leveraging NVIDIA GPU for enhanced processing.

Introduction

With the right setup, including the NVIDIA driver and CUDA toolkit, running large language models (LLMs) on a GPU becomes feasible. This post details how to achieve this on a RHEL 9.3 workstation.

Prerequisites

Ensure your system is ready with the following:

NVIDIA driver and CUDA toolkit installed.

Workstation specs:

OS: RHEL 9.3
RAM: 128 GB
CPU: i9–12900K (16C24T)
GPU: NVIDIA RTX 4060 Ti 16GB

For this setup, I’ll be accessing my workstation via SSH from a MacBook.

Step-by-Step Guide

1. Preparing the Environment

First, confirm that your system meets the necessary requirements, including the installation of the NVIDIA driver and CUDA toolkit.

2. Running Ollama Docker

The Ollama Docker container can be run in different modes, depending on whether you want to utilize the CPU or GPU. Here’s how to get started:

Starting with CPU-Only Version

Initially, you might want to run the CPU-only version to ensure everything is set up correctly:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

This command runs the Docker container in daemon mode, mounts a volume for model storage, and exposes port 11434.

To keep the environment clean and avoid mounting volumes initially, you can use:

docker run -it --rm -p 11434:11434 --name ollama ollama/ollama

Transitioning to GPU Acceleration

To leverage the GPU for improved performance, modify the Docker run command as follows:

docker run -it --rm --gpus=all -v /home/ollama:/root/.ollama:z -p 11434:11434 --name ollama ollama/ollama

This command ensures the Docker container has access to all available GPUs and mounts the /home/ollama directory for model storage, with :z to handle SELinux permissions.

3. Downloading and Running the Model

With the Ollama Docker container up and running, the next step is to download the LLaMA 3 model:

docker exec -it ollama ollama pull llama3

After downloading, you can list the available models and run the desired one:

docker exec -it ollama ollama list
docker exec -it ollama ollama run llama3

4. Testing the Setup

To test the model, you can use a curl command to send a request to the API:

curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "請列出五樣台灣美食",
"stream": true,
"options": {
"seed": 123,
"top_k": 20,
"top_p": 0.9,
"temperature": 0
}
}'

Conclusion

Following these steps, you should have the LLaMA 3 model running on your RHEL 9 system with NVIDIA GPU acceleration. This setup demonstrates the potential for significant performance improvements when leveraging GPUs for AI and machine learning tasks. Whether through CLI commands or API calls, you’re now equipped to explore further applications and integrations.