Serving AI models on the edge — Using Nvidia GPU with k3s on AWS— Part 4
Introduction
This is part 4 in the series of “Serving AI models on the edge”.
Here we will focus on using k3s to serve AI models as its a lightweight Kubernetes distro that is regularly used on edge platforms.
Overview
The Nvidia Container Toolkit along with the Nvidia Operator enables the Kubernetes/K3s cluster to access GPU and CUDA operations. How this is done is shown at a high level below:
The Nvidia GPU operator enables the Kubernetes cluster nodes to be able to use GPUs as shown below.
Steps
We use AWS EC2 as an example, but the following steps can be used with Nvidia-enabled Ubuntu VM node on any infrastructure.
Picking an Nvidia enabled GPU AMI
First, we pick an AMI that is already enabled with the Nvidia CUDA Toolkit, GPU Driver, Nvidia Container toolkit. This can be done by using the AMI catalog and searching for the Deep Learning AMI below.
Checking Nvidia and CUDA versions
Next, we verify that the Nvidia drivers are working properly by running nvidia-smi utility. Note that this gives the output below. If it could not communicate with the GPU it would have given an error message.
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 32C P0 60W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Checking Nvidia and docker
We can check if the docker runtime can leverage the GPU by doing:
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 28C P8 12W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Installing k3s
Though we captured how to install and configure k3s for AWS ECR in earlier articles in this series, following is a quick command to install k3s:
curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644
Configuring Nvidia GPU Operator
We will use the Nvidia GPU operator using the instructions from the operator page.
Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework. However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node labelling using GFD, DCGM based monitoring and others.
To install the operator:
# first, install the helm utility
sudo snap install helm
# we first install the helm repo for nvidia and update
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
# install the operator
helm install --wait nvidiagpu \
-n gpu-operator --create-namespace \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true \
nvidia/gpu-operator
NAME: nvidiagpu
LAST DEPLOYED: Tue Aug 8 00:54:41 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
# TIP: to uninstall
# helm uninstall -n gpu-operator nvidiagpu
Check nvidia-smi usage in a kubernetes pod
We check if nvidia-smi can be run inside a kubernetes pod. We define a pod that calls nvidia-smi below:
# gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvidia/cuda:11.6.2-base-ubuntu20.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
Deploy this pod:
kubectl apply -f gpu-pod.yaml
After deploying this and it becomes ready, check its logs:
kubectl logs gpu-pod
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 30C P8 24W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This shows the GPU being access from inside a pod running in Kubernetes
Check if pods can run some CUDA operations
To test if the pods can access the GPU, we need to do the following:
First, we define a pod yaml:
# vec-add-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: vec-add-pod
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
We deploy the pod:
kubectl apply vec-add-pod.yaml
Note that it takes some time to get activated and running. We then check the logs of the pod after it is running.
kubectl logs vec-add-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
This shows that the pod was able to run CUDA operations using the pod and used the GPU successfully.
Summary
We showed the deployment of the Nvidia Operator that GPU enables Kubernetes and k3s, and then showed that the pod was able to check access with nvidia-smi.
Then we deployed a pod that ran some CUDA operations to prove that CUDA operations can be run inside Kubernetes and k3s.
Next, we will show how to enable the gpt2 model that we deployed earlier leveraging the GPU rather than the CPU.