Setting up a Kubernetes cluster to run AI applications

Gong Ys
12 min readSep 20, 2023

--

LLM training or fine-tuning are popular these days. It is demanding to prepare one Kubernetes cluster to support these tasks quickly. Streamlining the whole process can be quite challenging. Especially if you are new to this. Here, in this article, I break it into a few steps:

- Understand cluster environment, such as the number of nodes, what’s in each node, and how they are connected.

- Install one base K8S cluster. Use a tool, which can be K8S cluster management, to set up a base K8S.

- Setup GPU features. K8s should provide GPUs on each node to its workloads.

- Setup network features. RDMA or ROCE should work for GPU communication if high performance is needed.

  1. Understand the environment

The setting of two nodes is arranged like this:

Besides the CPU, memories, and two 10G NIC cards, each node hosts eight A40 GPU cards and one Mellanox-5 100G card.

$ nvidia-smi 
Sun Aug 27 13:40:40 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:35:00.0 Off | 0 |
| 0% 30C P8 29W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:36:00.0 Off | 0 |
| 0% 31C P8 29W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:39:00.0 Off | 0 |
| 0% 30C P8 29W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:3D:00.0 Off | 0 |
| 0% 29C P8 28W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A40 On | 00000000:9C:00.0 Off | 0 |
| 0% 28C P8 29W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A40 On | 00000000:9D:00.0 Off | 0 |
| 0% 30C P8 30W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A40 On | 00000000:A0:00.0 Off | 0 |
| 0% 31C P8 31W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A40 On | 00000000:A4:00.0 Off | 0 |
| 0% 31C P8 31W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

2. Install a base K8s cluster with two nodes

I am using the kubeclipper, which is a good K8s cluster management tool. Please refer to Get Started for a quick start.

2.1 Prepare the OS

2.1.1 Setup passwordless login

The kubeclipper tool needs to access all nodes through SSH without a password. Use the following steps to do this:

$ sudo bash
# cd
# ssh-keygen
# cd .ssh
# cat id_rsa.pub >> ./authorized_keys
# ssh-copy-id 10.10.10.12

2.1.2 Disable the GSP-RM

There is an issue for K8S GPU operator at Disable the GSP-RM. Use the steps to workaroud it:

# echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
# update-initramfs -u
# reboot

2.2. Install the kubeclipper All-in-one

In production settings, the kubeclipper itself can be a cluster, here we use an AIO installation.

$ sudo bash
# curl -sfL https://oss.kubeclipper.io/get-kubeclipper.sh | KC_REGION=cn KC_VERSION=master bash -
# kcctl deploy --user root --pk-file .ssh/id_rsa --ip-detect interface=ens11f1
# kcctl login -H http://localhost -u admin -p Thinkbig1
# kcctl get node
+--------------------------------------+-------------+---------+-------------+-------------+-----+-----------+
| ID | HOSTNAME | REGION | IP | OS/ARCH | CPU | MEM |
+--------------------------------------+-------------+---------+-------------+-------------+-----+-----------+
| 60e5e75e-4ff7-419e-9c67-e444d8d88f9d | ubuntu10053 | default | 10.10.10.11 | linux/amd64 | 128 | 1031710Mi |
+--------------------------------------+-------------+---------+-------------+-------------+-----+-----------+

Just as we can see, after the command `kcctl deploy` runs, a node, where the kubeclipper is running, is registered into the kubeclipper node pools.

2.3. Add the second node

Before creating the k8s cluster, we need to add one more node into the kubeclipper. The `kcctl join` command can easily join nodes. Make sure the nodes are accessible from the `kcctl` operation node.

# kcctl join --agent 10.10.10.12 --pk-file .ssh/id_rsa --ip-detect interface=ens11f1
# kcctl get node
+--------------------------------------+-------------+---------+-------------+-------------+-----+-----------+
| ID | HOSTNAME | REGION | IP | OS/ARCH | CPU | MEM |
+--------------------------------------+-------------+---------+-------------+-------------+-----+-----------+
| 4fea0dea-8467-41a5-bb5e-b1fe2f360cf8 | ubuntu10054 | default | 10.10.10.12 | linux/amd64 | 128 | 1031710Mi |
| 60e5e75e-4ff7-419e-9c67-e444d8d88f9d | ubuntu10053 | default | 10.10.10.11 | linux/amd64 | 128 | 1031710Mi |
+--------------------------------------+-------------+---------+-------------+-------------+-----+-----------+

Since we have multiple nic cards on one node, ` — ip-detect interface=ens11f1` specifies which card will be used for communication among K8s components.

2.4. Deploy the k8s cluster with two nodes

Now we have two nodes available in the kubeclipper node pool. They can be easily used to deploy K8S nodes. Here, we use `ubuntu10053` as the control and worker node, and `ubuntu10054` as another worker node.

kcctl create cluster --master 10.10.10.11 --worker 10.10.10.12 --cluster-dns-domain ai.testio --untaint-master --name llm-cluster
[2023-08-30T03:40:48Z][INFO] use default containerd version 1.6.4
[2023-08-30T03:40:48Z][INFO] use default calico version v3.26.1
[2023-08-30T03:40:48Z][INFO] use default k8s version v1.27.4
+-------------+-------------------------------+
| NAME | CREATE TIMESTAMP |
+-------------+-------------------------------+
| llm-cluster | 2023-08-30 03:40:48 +0000 UTC |
+-------------+-------------------------------+

` — untaint-master` parameter enables the master node work as the control and worker node.

Note: Use `-cluster-dns-domain ai.testio` if you make sure the apps you will install on the cluster will not use `default.local`

Use the following command to get the cluster status and installation progress:

# kcctl get cluster -o yaml|grep status -A5
status:
phase: Installing
versions:
apiserver: ""
controlPlane: ""
controllerManager: ""

Thanks to the team’s effort, the installation progress takes just a few minutes. In my case, it takes 6 minutes.

As a K8S cluster management tool, kubeclipper can help us easily remove and add nodes, put nodes into a K8S cluster, remove nodes from a K8S cluster, destroy the whole K8S, and even destroy itself.

Please refer to its GitHub site for more usage. Besides the `kcctl` command line, it has a beautiful GUI.

3. Setup GPU features

GPU is important for running AI applications. Here we will use the Nvidia GPU operator to manage these GPUs.

3.1. Deploy the node feature discovery

Nvidia GPU operator can help us install NFD, but here we install it independently.

# helm repo add node-feature-discovery https://kubernetes-sigs.github.io/node-feature-discovery/charts
# helm search repo
# helm install nfd node-feature-discovery/node-feature-discovery --set image.repository=k8s.m.daocloud.io/nfd/node-feature-discovery --create-namespace --namespace=gpu-operator

In the above install command ` — set image.repository=k8s.m.daocloud.io/nfd/` is used to override the image due to some known reasons.

3.2. Deploy the GPU operator

# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
# helm install --namespace gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --set nfd.enabled=false

`-set nfd.enabled=false` is used because it has been installed independently. Now the GPUs should be ready for use.

To check if GPU devices are reported into allocatable resources:

kubectl describe node ubuntu10053 | grep Allocatable -A 6
Allocatable:
cpu: 128
ephemeral-storage: 850179124660
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1056369200Ki
nvidia.com/gpu: 8

`nvidia.com/gpu: 8` shows us the node has 8 GPUs.

3.3. Test the GPU allocation

We use `nvidia-smi` tool in a pod to show that the K8s can allocate GPUs now.

# kubectl apply -f gpu-operator/k8s-gpu-test.yaml
# kubectl get pod
NAME READY STATUS RESTARTS AGE
gpu-test-6d5c876688-9xslc 1/1 Running 0 80s
gpu-test-6d5c876688-hlqnx 1/1 Running 0 80s
# kubectl logs gpu-test-6d5c876688-9xslc
Wed Aug 30 09:18:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:35:00.0 Off | 0 |
| 0% 31C P8 30W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:36:00.0 Off | 0 |
| 0% 32C P8 30W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:39:00.0 Off | 0 |
| 0% 33C P8 32W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:3D:00.0 Off | 0 |
| 0% 29C P8 28W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A40 On | 00000000:9C:00.0 Off | 0 |
| 0% 29C P8 29W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A40 On | 00000000:9D:00.0 Off | 0 |
| 0% 31C P8 32W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A40 On | 00000000:A0:00.0 Off | 0 |
| 0% 31C P8 30W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A40 On | 00000000:A4:00.0 Off | 0 |
| 0% 32C P8 30W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

4. Setup the ROCE network feature

Although the Nvidia network operator can install the nic driver for us, due to some known reason, Here I install Linux InfiniBand Drivers and set its ROCE feature manually before deploying the network operator.

4.1. Deploy the network operator

Nvidia network operator deployment is described very clearly at Nvidia Network Operator.

# helm search repo nvidia/network-operator
NAME CHART VERSION APP VERSION DESCRIPTION
nvidia/network-operator 23.7.0 v23.7.0 Nvidia network operator

# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

Here, we use Network Operator Deployment with a Secondary Network configuration.

# cat > net-values.yaml <<EOF
test:
pf: ens121np0


nfd:
enabled: false
sriovNetworkOperator:
enabled: false
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
deploy: false

nvPeerDriver:
deploy: false

rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
ifNames: [ens121np0]

sriovDevicePlugin:
deploy: false

secondaryNetwork:
deploy: true
multus:
deploy: true
cniPlugins:
deploy: true
ipamPlugin:
deploy: true
EOF

In the above command, we set `nfd.enabled: false` since NFD is installed already.

# helm install network-operator -n nvidia-network-operator --create-namespace ./ -f net-values.yaml
# kubectl get pod -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
cni-plugins-ds-25jpf 1/1 Running 0 14h
cni-plugins-ds-fhfrz 1/1 Running 0 14h
kube-multus-ds-d77pg 1/1 Running 0 14h
kube-multus-ds-dpnjc 1/1 Running 0 14h
network-operator-764bccf98b-4vrxt 1/1 Running 0 14h
rdma-shared-dp-ds-pnxqz 1/1 Running 0 14h
rdma-shared-dp-ds-rvtqk 1/1 Running 0 14h
whereabouts-79zq5 1/1 Running 0 14h
whereabouts-c6fvh 1/1 Running 0 14h

As we can see, the network operator will install some Kubernetes resources according to the previous configuration:

- CNI plugin daemonset, which is used to copy CNI binary into `/opt/cni/bin`

- Multus daemonset, which is a K8S multi-card enabler

- network operator, which provides some CRD APIs for use to create related resources

- RDMA shared device plugin, a device plugin shares host ROCE enabled card for K8s workloads

- whereabouts, an IPAM plugin for CNI.

Run the helm test with the following command after deploying the network operator with the `helm`:

# helm test -n nvidia-network-operator network-operator --timeout=5m
NAME: network-operator
LAST DEPLOYED: Sat Sep 2 09:49:25 2023
NAMESPACE: nvidia-network-operator
STATUS: deployed
REVISION: 3
TEST SUITE: network-operator-dev-plugin-test
Last Started: Sat Sep 2 09:50:45 2023
Last Completed: Sat Sep 2 09:50:54 2023
Phase: Succeeded

4.2. Create a Macvlan network

To add a second nic for workload, another CNI is needed, here we use MACVLAN CNI. Please refer to MacvlanNetwork CRD for usage.

We use Whereabouts IPAM for IP address management for the second NIC.

Let’s look at the network definition first:

# cat network-operator/macvlan-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: macvlannetwork-1
spec:
networkNamespace: "nvidia-network-operator"
master: "ens121np0"
mode: "bridge"
mtu: 1500
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.2.0/24",
"exclude": [
"192.168.2.0/32",
"192.168.2.1/32"
],
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"gateway": "192.168.2.1"
}

It defined a `MacvlanNetwork` resource created by the Nvidia network operator. `master: “ens121np0”` is using the host ROCE-enabled nic card.

# kubectl apply -f network-operator/macvlan-network.yaml
# kubectl get network-attachment-definitions -n nvidia-network-operator -oyaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
creationTimestamp: "2023-08-30T13:47:29Z"
generation: 1
labels:
nvidia.network-operator.state: state-Macvlan-Network
name: macvlannetwork-1
namespace: nvidia-network-operator
ownerReferences:
- apiVersion: mellanox.com/v1alpha1
blockOwnerDeletion: true
controller: true
kind: MacvlanNetwork
name: macvlannetwork-1
uid: 89a8162d-f3f1-4bcd-a486-dce89f591ffa
resourceVersion: "70239"
uid: 8891e852-c8c2-4027-b158-0f4dca48fdda
spec:
config: '{ "cniVersion":"0.3.1", "name":"macvlannetwork-1", "type":"macvlan","master":
"ens121np0","mode" : "bridge","mtu" : 1500,"ipam":{"type":"whereabouts","datastore":"kubernetes","kubernetes":{"kubeconfig":"/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"},"range":"192.168.2.0/24","exclude":["192.168.2.0/32","192.168.2.1/32"],"log_file":"/var/log/whereabouts.log","log_level":"info","gateway":"192.168.2.1"}
}'
kind: List
metadata:
resourceVersion: ""

`MacvlanNetwork` resource is translated into K8s `NetworkAttachmentDefinition` by the operator.

4.3. Test the network

Here we use the Nvidia mofed test image to test the RDMA connection. We will create a K8S deployment resource to generate one pod on each node. The resource configuration file is like this:

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: rdma-test
spec:
selector:
matchLabels:
app: rdma
replicas: 2
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: nvidia-network-operator/macvlannetwork-1
labels:
app: rdma
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rdma
topologyKey: "kubernetes.io/hostname"
containers:
- name: rdma-test
image: mellanox/rping-test
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
rdma/rdma_shared_device_a: 1
requests:
rdma/rdma_shared_device_a: 1
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000

The pod will have a second network defined by annotation `k8s.v1.cni.cncf.io/networks: nvidia-network-operator/macvlannetwork-1` and have an InfiniBand nic defined by `rdma/rdma_shared_device_a: 1` in spec. The two pods will be placed on their node which is guaranteed by ‘podAntiAffinity’ in the spec.

Now we can use the normal k8s command to create the pods and look at their placements:

# kubectl apply -f network-operator/rdma-test.yaml
# kubectl get pod -owide -l app=rdma
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rdma-test-5f7768c6dd-4nqtx 1/1 Running 0 18m 172.25.201.168 ubuntu10053 <none> <none>
rdma-test-5f7768c6dd-d28l2 1/1 Running 0 18m 172.25.33.26 ubuntu10054 <none> <none>

When the pods are running, the connection test can be run with the `rping` command.

First, let’s have a look at the network devices of these two pods. since these two pods have the same settings, we just check one of them:

$ kubectl logs rdma-test-5f7768c6dd-4nqtx
/dev/infiniband:
total 0
crw------- 1 root root 231, 64 Sep 13 03:52 issm0
crw-rw-rw- 1 root root 10, 57 Sep 13 03:52 rdma_cm
crw------- 1 root root 231, 0 Sep 13 03:52 umad0
crw-rw-rw- 1 root root 231, 192 Sep 13 03:52 uverbs0

/sys/class/infiniband:
total 0
lrwxrwxrwx 1 root root 0 Sep 13 03:52 mlx5_0 -> ../../devices/pci0000:16/0000:16:02.0/0000:17:00.0/infiniband/mlx5_0

/sys/class/net:
total 0
lrwxrwxrwx 1 root root 0 Sep 13 03:52 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx 1 root root 0 Sep 13 03:52 lo -> ../../devices/virtual/net/lo
lrwxrwxrwx 1 root root 0 Sep 13 03:52 net1 -> ../../devices/virtual/net/net1

The above command output shows us it has one Infiniband class device named `mlx5_0`. the `net1` ethernet device is the MacVlan device based on the host net device ‘ens121np0’. In fact, the `mlx5_0` device is a device on the ‘ens121np0’ too.

We can get the IP address of the `net1`:

# kubectl exec -ti rdma-test-5f7768c6dd-4nqtx -- ip a show dev net1
4: net1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether a2:40:36:3b:40:f4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.2.226/28 brd 192.168.2.239 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::a040:36ff:fe3b:40f4/64 scope link
valid_lft forever preferred_lft forever

On the first pod, which works as the server:

# kubectl exec -ti rdma-test-5f7768c6dd-4nqtx - rping -s -p 12345 -a 192.168.2.226 -v

On the second pod, which works as the client:

# kubectl exec -ti rdma-test-5f7768c6dd-qf25g - rping -c -C 5 -p 12345 -a 192.168.2.226 -v

This will have the output:

It shows the ROCE is working.

Enjoy the ROCE-enabled and GPU-enabled Kubernetes cluster.

My git repo has all the codes in this article.

--

--