Run LLMs Locally with GPU Acceleration: Step-by-Step Guide to Setting Up NVIDIA GPU Operator, Ollama, and Open WebUI on Kubernetes

Süleyman Kütükoğlu
6 min readNov 12, 2024

--

Hello!

The concept of Large Language Models (LLMs) has been gaining significant traction ever since tools like ChatGPT entered our lives. Many of us are curious about how we can leverage the power of these models in our own environments.

In this post, I’ll walk you through the process of setting up NVIDIA GPU Operator, Ollama, and Open WebUI on a Kubernetes cluster with an NVIDIA GPU.

By the end, you’ll have everything set up and be able to test with a model yourself. The steps may sound a bit technical, but I assure you they are straightforward with the right guidance, and I’ll be here with you every step of the way.

First, let’s briefly look at what Ollama and Open WebUI are.

Ollama: A tool for running and managing LLMs locally. Ollama makes it easy for developers or users to install, configure, and use these models. It provides the necessary tools to run models efficiently in a local environment through a simple command line and API. It allows users to build private models while maintaining data privacy.

Open WebUI: An open-source web interface developed for user-friendly interaction with large language models. Open WebUI provides an easier experience by allowing you to use language models via a web browser. This means you can experiment with or use large language models in a web-based environment without command-line knowledge. Since it is open-source, it can be customized and further developed.

Prerequisites:

  • One Kubernetes cluster (I’m running version 1.31 in this demo)
  • A worker node in the cluster with an NVIDIA GPU (I’m using an A40 in this demo)
  • Helm
  • Kubectl
  • Optionally, NGINX Ingress Controller and Cert-Manager for ingress access.

Step 1: Installing NVIDIA GPU Operator

First, add the official NVIDIA Helm repo using the following command:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Then, install it in the gpu-operator namespace:

helm install --wait nvidia-gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

After waiting for a while, the output of the command kubectl get po -n gpu-operator should look like this:

Successfull GPU Operator Installation

Once the NVIDIA GPU Operator is successfully installed, the following annotation will automatically be added to the worker node with an NVIDIA GPU.

Node Annotation

Next, let’s install Ollama.

Step 2: Installing Ollama

First, create a namespace for Ollama:

kubectl create ns ollama

Then, apply the following Deployment and Service YAMLs:

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
strategy:
type: Recreate
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
env:
- name: PATH
value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
ports:
- name: http
containerPort: 11434
protocol: TCP
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: ClusterIP
selector:
name: ollama
ports:
- port: 80
name: http
targetPort: http
protocol: TCP

To access the Ollama API via HTTPS and a domain, edit and apply the following Ingress according to your setup:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ollama
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- ollama-api.suleyman.academy
secretName: ollama-tls
rules:
- host: ollama-api.suleyman.academy
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 80

If you see the response below when visiting ollama-api.suleyman.academy, then the setup is successful.

Successfull Ollama Installation

With Ollama successfully installed, let’s proceed to the Open WebUI installation for easier usage.

Step 3: Installing Open WebUI

First, add the Open WebUI Helm repository:

helm repo add openwebui https://helm.openwebui.com
helm repo update

Then, customize and run the following Helm install command:

helm install open-webui open-webui-demo \
--set ollama.enabled=false \
--set ingress.enabled=true \
--set ingress.class="nginx" \
--set ingress.annotations."cert-manager\.io/cluster-issuer"="letsencrypt-prod" \
--set ingress.host="chat.suleyman.academy" \
--set ingress.tls=true \
--set ingress.existingSecret="openwebui-tls"

After waiting for a while, visit https://chat.suleyman.academy and complete the admin registration by clicking the "Sign Up" button.

Sucessfull Open WebUI Installation

Now, let’s connect Ollama to Open WebUI. Follow the steps below to establish the Ollama connection.

Enter your Ollama API address (in my case it is https://ollama-api.suleyman.academy)

If you see this message after step 5, then the Ollama connection has been successfully established.

To test the setup, let’s download the Mistral 7B model and perform a chat test. You can see other available models at https://github.com/ollama/ollama.

Follow the steps below to download the Mistral 7B model to Ollama running on the Kubernetes cluster.

After successfully downloading the model, let’s perform a chat test by following the steps below.

Click new chat then switch model to mistral:7b after that text your message.

As you can see, we received a successful response.

Additionally, the screenshot below shows that a request was made to the Ollama POD running on Kubernetes.

Conclusion:

And that’s it!

If you’ve followed along, you should now have a fully functioning setup of NVIDIA GPU Operator, Ollama, and Open WebUI. We even downloaded and tested the Mistral 7B model successfully. Running these models locally gives you more control, keeps your data private, and offers great flexibility — all while making it easy to experiment.

I hope you found this guide useful and feel confident exploring further. If you have any questions or insights to share, feel free to leave a comment below.

Happy experimenting, and enjoy the world of LLMs in your Kubernetes cluster!

References:

To dive deeper into the tools discussed and find further documentation, here are some helpful links:

--

--

No responses yet