GPU time-sharing with multiple workloads in Google Kubernetes Engine

Published in

Opsnetic

5 min readJul 25, 2022

For Graphical rendering, Machine Learning training, and inferencing scenarios GPU resources are required. If you are managing these workloads in Kubernetes, effective utilization of expensive GPU resources can help you reduce the cost of the overall infrastructure!

What is Time-Sharing ?

Kubernetes enables applications to precisely request the resource amounts they need to function. While you can request fractional CPU units for applications, you can’t request fractional GPU units.

Time-sharing is a GKE (Google Kubernetes Engine) feature that lets multiple containers share a single physical GPU attached to a node. Using GPU time-sharing in GKE lets you more efficiently use your attached GPUs and save running costs. Time-shared GPUs are ideal for running workloads that don’t need to use high amounts of GPU resources all the time.

Limitations to keep in check

Before using GPUs on GKE, keep in mind the following limitations:

You cannot add GPUs to existing node pools.
GPU nodes cannot be live migrated during maintenance events.
The GPU type you can use depends on the machine series, as follows: A2 machine series — A100 GPUs & N1 machine series — All GPUs except A100.
GPUs are not supported in Windows Server node pools.
The maximum number of containers that can share a single physical GPU is 48.
You can enable time-sharing GPUs on GKE Standard clusters and node pools running GKE version 1.23.7-gke.1400 and later.

Creating a GPU Time-Sharing GKE cluster

Step 1: Create a GKE cluster with the following gcloud command

You can run the gcloud commands in the cloud shell or any shell authorized to interact with your GCP workloads.

gcloud container clusters create gpu-time-sharing \
    --region=us-central1-a \
    --cluster-version=1.23.5-gke.1503 \
    --machine-type=n1-standard-2 \
    --disk-type "pd-standard" \
    --disk-size "50" \
    --max-pods-per-node "48" \
    --enable-ip-alias \
    --default-max-pods-per-node "48" \
    --spot \
    --num-nodes "1"    
    --accelerator=type=nvidia-tesla-k80,count=1,gpu-sharing-strategy=time-sharing,max-shared-clients-per-gpu=48 \

The above command creates a GKE cluster with spot VM’s which helps in cutting the cost of the overall infrastructure in more than half of the actual cost.

You can use GPUs with Spot VMs if your workloads can tolerate frequent node disruptions.

You can look at the GPU platforms available to attach to the nodes and choose the one which fits best for your needs.

Successfully deployed GKE GPU Time-Sharing cluster

Step 2: Create an additional node pool (Optional)

gcloud beta container node-pools create "pool-1" \
    --cluster "gpu-time-sharing" \
    --zone "us-central1-a" \
    --node-version "1.23.5-gke.1503" \
    --machine-type "n1-standard-2" \
    --accelerator "type=nvidia-tesla-p4,count=1" \
    --disk-type "pd-standard" \
    --disk-size "50" \
    --num-nodes "1" \
    --enable-autoupgrade \
    --enable-autorepair \
    --max-pods-per-node "48"
    --spot

Step 3: Get access to the GKE cluster through kubeconfig

The below command adds a kubeconfig file to the system due to which you can switch-context to the created cluster and execute kubectl commands on it.

gcloud container clusters get-credentials gpu-time-sharing

Step 4: Testing GPU time-sharing functionality

kubectl get nodes

Displaying Nodes on Cluster

As we have created only 1 node node-pool for the cluster. We are able to correctly see the output using the kubectl command.

Now, install the GPU device drivers from NVIDIA that manage the time-sharing division of the physical GPUs. To install the drivers, you deploy a GKE installation DaemonSet that sets the drivers up.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

The above command deploys the installation DaemonSet and install the default GPU driver version. You can find more information regarding this installation here.

kubectl describe nodes gke-gpu-time-sharing-default-pool-6a5e9e79-rh8x

After describing the created node, we are able to verify that the allocatable GPUs are 48 which are logical fragments of 1 GPU we originally attached!

Now, the following Kubernetes YAML file contains the deployment of pods in which the container prints the UUID of the GPU that’s attached to it.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-simple
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cuda-simple
  template:
    metadata:
      labels:
        app: cuda-simple
    spec:
      nodeSelector:
        cloud.google.com/gke-gpu-sharing-strategy: time-sharing
        cloud.google.com/gke-max-shared-clients-per-gpu: "48"
      containers:
      - name: cuda-simple
        image: nvidia/cuda:11.0-base
        command:
        - bash
        - -c
        - |
          /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
        resources:
          limits:
            nvidia.com/gpu: 1

After successful creation of the above deployment we can notice 3 pods are in running state.

kubectl get pods

By printing their individual logs, allocation of GPU fragments can be confirmed for each pod as shown below.

$ kubectl logs cuda-simple-749bf54c4d-864zv
GPU 0: Tesla K80 (UUID: GPU-c9cbf47c-b630-d1d3-b79f-421ec976fbc5)$ kubectl logs cuda-simple-749bf54c4d-csq4d
GPU 0: Tesla K80 (UUID: GPU-c9cbf47c-b630-d1d3-b79f-421ec976fbc5)$ kubectl logs cuda-simple-749bf54c4d-dnxqx
GPU 0: Tesla K80 (UUID: GPU-c9cbf47c-b630-d1d3-b79f-421ec976fbc5)

Since, we have created a single node cluster with a GPU attached to each node. Basically we are dealing with 1 GPU only!

All the pods in that node will have logical fragments of that GPU attached to them. And the maximum pods which could achieve this behaviour are 48 per node.

kubectl get pods

After seeing the node description, we can confirm that 10 logical GPU fragments of our Tesla K80 have been successfully allocated to the running pods (as each pod requests 1 GPU resource from the node).

Looking at Allocated resources of the node after rescaling deployment

Conclusion

Here in this post we have discussed about achieving GPU time-sharing in GKE environment, their proper utilization, and how to achieve cost benefits with spot instances.

If you need help with DevOps practices, or Kubernetes at your company, feel free to reach out to us at Opsnetic.