Run LLMs Locally with GPU Acceleration: Step-by-Step Guide to Setting Up NVIDIA GPU Operator, Ollama, and Open WebUI on Kubernetes
Hello!
The concept of Large Language Models (LLMs) has been gaining significant traction ever since tools like ChatGPT entered our lives. Many of us are curious about how we can leverage the power of these models in our own environments.
In this post, I’ll walk you through the process of setting up NVIDIA GPU Operator, Ollama, and Open WebUI on a Kubernetes cluster with an NVIDIA GPU.
By the end, you’ll have everything set up and be able to test with a model yourself. The steps may sound a bit technical, but I assure you they are straightforward with the right guidance, and I’ll be here with you every step of the way.
First, let’s briefly look at what Ollama and Open WebUI are.
Ollama: A tool for running and managing LLMs locally. Ollama makes it easy for developers or users to install, configure, and use these models. It provides the necessary tools to run models efficiently in a local environment through a simple command line and API. It allows users to build private models while maintaining data privacy.
Open WebUI: An open-source web interface developed for user-friendly interaction with large language models. Open WebUI provides an easier experience by allowing you to use language models via a web browser. This means you can experiment with or use large language models in a web-based environment without command-line knowledge. Since it is open-source, it can be customized and further developed.
Prerequisites:
- One Kubernetes cluster (I’m running version 1.31 in this demo)
- A worker node in the cluster with an NVIDIA GPU (I’m using an A40 in this demo)
- Helm
- Kubectl
- Optionally, NGINX Ingress Controller and Cert-Manager for ingress access.
Step 1: Installing NVIDIA GPU Operator
First, add the official NVIDIA Helm repo using the following command:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Then, install it in the gpu-operator
namespace:
helm install --wait nvidia-gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator
After waiting for a while, the output of the command kubectl get po -n gpu-operator
should look like this:
Once the NVIDIA GPU Operator is successfully installed, the following annotation will automatically be added to the worker node with an NVIDIA GPU.
Next, let’s install Ollama.
Step 2: Installing Ollama
First, create a namespace for Ollama:
kubectl create ns ollama
Then, apply the following Deployment and Service YAMLs:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
strategy:
type: Recreate
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
env:
- name: PATH
value: /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
ports:
- name: http
containerPort: 11434
protocol: TCP
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: ClusterIP
selector:
name: ollama
ports:
- port: 80
name: http
targetPort: http
protocol: TCP
To access the Ollama API via HTTPS and a domain, edit and apply the following Ingress according to your setup:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ollama
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- ollama-api.suleyman.academy
secretName: ollama-tls
rules:
- host: ollama-api.suleyman.academy
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 80
If you see the response below when visiting ollama-api.suleyman.academy
, then the setup is successful.
With Ollama successfully installed, let’s proceed to the Open WebUI installation for easier usage.
Step 3: Installing Open WebUI
First, add the Open WebUI Helm repository:
helm repo add openwebui https://helm.openwebui.com
helm repo update
Then, customize and run the following Helm install command:
helm install open-webui open-webui-demo \
--set ollama.enabled=false \
--set ingress.enabled=true \
--set ingress.class="nginx" \
--set ingress.annotations."cert-manager\.io/cluster-issuer"="letsencrypt-prod" \
--set ingress.host="chat.suleyman.academy" \
--set ingress.tls=true \
--set ingress.existingSecret="openwebui-tls"
After waiting for a while, visit https://chat.suleyman.academy
and complete the admin registration by clicking the "Sign Up" button.
Now, let’s connect Ollama to Open WebUI. Follow the steps below to establish the Ollama connection.
Enter your Ollama API address (in my case it is https://ollama-api.suleyman.academy)
If you see this message after step 5, then the Ollama connection has been successfully established.
To test the setup, let’s download the Mistral 7B model and perform a chat test. You can see other available models at https://github.com/ollama/ollama.
Follow the steps below to download the Mistral 7B model to Ollama running on the Kubernetes cluster.
After successfully downloading the model, let’s perform a chat test by following the steps below.
Click new chat then switch model to mistral:7b after that text your message.
As you can see, we received a successful response.
Additionally, the screenshot below shows that a request was made to the Ollama POD running on Kubernetes.
Conclusion:
And that’s it!
If you’ve followed along, you should now have a fully functioning setup of NVIDIA GPU Operator, Ollama, and Open WebUI. We even downloaded and tested the Mistral 7B model successfully. Running these models locally gives you more control, keeps your data private, and offers great flexibility — all while making it easy to experiment.
I hope you found this guide useful and feel confident exploring further. If you have any questions or insights to share, feel free to leave a comment below.
Happy experimenting, and enjoy the world of LLMs in your Kubernetes cluster!
References:
To dive deeper into the tools discussed and find further documentation, here are some helpful links:
- Ollama GitHub Repository: Explore available models and get more detailed information about Ollama.
- NVIDIA GPU Operator Documentation: Find official documentation to help you better understand the NVIDIA GPU Operator.
- Open WebUI GitHub Repository: Learn more about Open WebUI and see how you can customize or extend its features.