Looking for GPU Capacity ? DWS got you covered !

Guilhem Tesseyre
Zencore Engineering
11 min readFeb 13, 2024

--

Except if you have been living under a rock for a year or so, you probably heard of the rush to Artificial Intelligence (AI). With the advent of Large Language Models (LLMs), every company out there is in some fashion, close or far, doing something with AI. Either creating their own solution, training their own models, using others, doing Machine Learning (ML) models the “old fashion way” … bref, something ! And naturally, as a result of this craze, it has created quite the stress on the hardware infrastructure leveraged for those technologies, i.e. CPUs, GPUs and TPUs. Graphic Processing Units (GPUs) have been the tool of choice to train models and run inferences against them for some time now, and it is part of the standard hardware toolkit for ML. There is one (very) happy company out there (yes, I’m looking at you Nvidia :)…) who has been in the (unique) driver seat of the GPU supply, and can’t seem to be able to manufacture them fast enough at this point given the demand across all cloud providers and users. Needless to say that given this context, the GPU capacity on the latest and fastest hardware has been scarce.

At Zencore we work with a various set of companies operating in the AI space. One of the pattern that we have seen emerge in the last couple years is to use Kubernetes, and more specifically GKE in our case as we work with Google Cloud, for large scale computational jobs with the goal to train ML models. GKE provides a lot of features around self-healing, auto-scaling, auto-upgrades, and node provisioning that make it relatively straightforward to rely on it for automating the provisioning of machines with specific hardware attached (i.e GPUs) to run jobs (i.e ML training) without having to setup HPC-like schedulers and resources. We have seen GKE become one of the de-facto option for HPC-like infrastructure in the cloud, especially when it relates to ML work.

One of the recent news on GKE is the integration with a feature called Dynamic Workload Scheduler (DWS) (announcement here) which is designed to help with the scheduling of the accelerators needed for ML training by leveraging a controlled pool of capacity dedicated for this feature. It has two modes for scheduling, flex start (currently available) and calendar (preview in Q1 2024), depending if you’d rather schedule a start for your workload in the future at a specific date/time and for a defined duration (Calendar mode) or if you are ok with letting DWS find and schedule the capacity as your job is ready to run (Flex mode).

Dynamic Workload Scheduler intelligently persists the request; once the capacity becomes available, it automatically provisions your VMs enabling your workloads to run continuously for the entire duration of the capacity allocation. Dynamic Workload Scheduler supports capacity requests for up to seven days, with no minimum duration requirement. You can request capacity for as little as a few minutes or hours; typically, the scheduler can fulfill shorter requests more quickly than longer ones.

To make a simple analogy, DWS is basically like Tetris for GPU capacity requests in Google Cloud. It knows the shape and resources needed for the requests coming in, and it will then schedule the job execution accordingly to maximize GPU usage. Below is a visual illustrating DWS function, and it very much reminds me of my 8-bit Game Boy days :)

DWS orchestrates the scheduling of the requests based on GPU capacity

I decided to give it a try in combination with Kueue (an open source job queuing solution that runs on top of Kubernetes) and I will walk you through a few things in the remaining part of this blog post :

  • How to set up a GKE cluster with DWS enabled node-pool(s)
  • How to install Kueue and use it
  • How to use DWS to get A100 and T4 GPU capacity for job execution

GKE Cluster & node-pool setup

First, let’s create and configure a GKE cluster. For the purpose of this demo, I am going to do something simple and easy using gcloud to provision my cluster, just creating the node-pools I need on “relatively” small instances and using Spot instances when I can to lower my cost.

First let’s created our cluster, g-dws-test-cluster

gcloud container clusters create g-dws-test-cluster \
--region us-central1 \
--release-channel rapid

Then let’s create a dedicated node-pool for T4 GPUs with DWS enabled

gcloud beta container node-pools create dws-t4-nodepool \
--cluster=g-dws-test-cluster \
--zone=us-central1 \
--node-locations=us-central1-c \
--num-nodes=0 \
--machine-type=n1-standard-2 \
--enable-autoscaling \
--total-min-nodes=0 \
--total-max-nodes=1 \
--disk-size=20GB \
--enable-queued-provisioning \
--reservation-affinity=none \
--accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest

And the same thing for A100 GPUs with DWS enabled as well

gcloud beta container node-pools create dws-a100-nodepool \
--cluster=g-dws-test-cluster \
--zone=us-central1 \
--node-locations=us-central1-a \
--num-nodes=0 \
--machine-type=a2-highgpu-1g \
--enable-autoscaling \
--total-min-nodes=0 \
--total-max-nodes=1 \
--disk-size=20GB \
--enable-queued-provisioning \
--reservation-affinity=none \
--accelerator type=nvidia-tesla-a100,count=1,gpu-driver-version=latest

For easiness of presentation, readability of the results, and cost control via Spot instances, I’m going to delete the default node-pool and create a system-nodepool to run some of our system pods and also to be able to run Kueue.

gcloud beta container node-pools delete default-pool --cluster=g-dws-test-cluster --zone=us-central1
gcloud beta container node-pools create system-nodepool \
--cluster=g-dws-test-cluster \
--zone=us-central1 \
--node-locations=us-central1-f \
--num-nodes=1 \
--machine-type=e2-small \
--enable-autoscaling \
--total-min-nodes=1 \
--total-max-nodes=3 \
--disk-size=20GB \
--enable-autoupgrade \
--enable-autorepair \
--spot

We now have a working GKE clusters, with 3 node-pools, including 2 of them with DWS enabled, 1 for A100 GPUs and 1 for T4 GPUs.

Kueue installation & setup

Kueue Components

I used this Github repo to get some DWS examples that I tweaked for my own usage and also the manifests to install Kueue

git clone https://github.com/GoogleCloudPlatform/ai-on-gke
cd ai-on-gke/tutorials-and-examples/workflow-orchestration/dws-examples

You will find the 3 following files in that folder :

  • kueue-manifests.yaml — contains the installation manifests and configuration instructions for Kueue in the cluster
  • dws-queues.yaml — this is used to create a cluster level queue that wraps your DWS interfaces and contains a sample of your first namespace level queue that connects to it
  • job.yaml — a sample job spec that uses DWS
kubectl create -f ./kueue-manifests.yaml
kubectl create -f ./dws-queues.yaml

I’m then going to slightly edit the job YAML to use T4 GPUs for that job. See below the edit of the nodeSelector spec to point to the dws-t4-nodepool we just created earlier. Please note in the metadata the reference to the queue namespace that was created earlier when configuring Kueue. This tells Kueue it’s responsible for orchestrating that job. Please also note the flag suspend: true which tells Kubernetes to create the job resource but wait with creating the pods. Kueue will change that flag to “false” when nodes are ready for the job and it can be executed.

apiVersion: batch/v1
kind: Job
metadata:
name: sample-job
namespace: default
labels:
kueue.x-k8s.io/queue-name: dws-local-queue
spec:
parallelism: 1
completions: 1
suspend: true
template:
spec:
nodeSelector:
cloud.google.com/gke-nodepool: dws-t4-nodepool
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: dummy-job
image: gcr.io/k8s-staging-perf-tests/sleep:v0.0.3
args: ["120s"]
resources:
requests:
cpu: "100m"
memory: "100Mi"
nvidia.com/gpu: 1
limits:
cpu: "100m"
memory: "100Mi"
nvidia.com/gpu: 1
restartPolicy: Never

I’ll also create another job template that I’ll call sample-job-a100 specifically pointing at the dws-a100-nodepool that we created earlier for the A100 GPUs.

apiVersion: batch/v1
kind: Job
metadata:
name: sample-job-a100
namespace: default
labels:
kueue.x-k8s.io/queue-name: dws-local-queue
spec:
parallelism: 1
completions: 1
suspend: true
template:
spec:
nodeSelector:
cloud.google.com/gke-nodepool: dws-a100-nodepool
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: dummy-job
image: gcr.io/k8s-staging-perf-tests/sleep:v0.0.3
args: ["120s"]
resources:
requests:
cpu: "100m"
memory: "100Mi"
nvidia.com/gpu: 1
limits:
cpu: "100m"
memory: "100Mi"
nvidia.com/gpu: 1
restartPolicy: Never

At this point we have a GKE cluster up and running with 2 dedicated node-pools each waiting for jobs to be executed using specific GPU hardware. We will run some watch commands to observe what is happening when we execute the jobs

Initial state of the cluster — no nodes provisioned in the DWS enabled node-pools
Picture of the initial running pods — System components and Kueue

Running K8S jobs using DWS

Now that we understand what DWS does and that we have a cluster to test this feature, let’s run our first job, asking for a T4.

The request will be submitted, then accepted by GKE and queued until it is processed.

We can notice the creation of a new CRD type, ProvisioningRequest , which is the resource whom API is being used by Kueue to process those requests. We can see the provisionning request created for our job sample below, it has been accepted by GKE at this point, and it is waiting on a node to be provisioned before executing.

It takes a few minutes after the request has been accepted for the node to be provisioned
A GKE node with a T4 GPU has been provisioned and can now be used for execution of the job
We can see the T4 node pool now has 1 node provisioned
We can see the addition of a node to the cluster, specifically on the DWS T4 node pool
Our job is being executed on a T4
Since the provisionning request has been addressed, we notice that there are none left in the queue
The execution is now done and the job marked as Completed
Finally from monitoring quotas we can indeed see the allocation of a T4

It took us about 5 min from asking GKE for a T4 to execute a job, and have DWS schedule the job for execution, get us a GPU and a node provisioned, and complete the execution of the job. Granted, we’re doing a very small “computation” here, with a sleep instruction of 2 min, it’s just to represent the execution of something on a GPU.

Let’s now do the same operation with an A100 GPU and see what happens. Those have been in really high demand and at times they are hard to get. We will run our A100 sample job we described earlier. We can see below the sequence of event regarding the provisioning request.

New request expressed — not yet accepted
Request accepted and queued — waiting for provisioning of a node
Node provisioned and request being executed
A new node has been provisioned on the A100 node-pool
Allocation of an A100 GPU on the newly provisioned node
It’s worth noticing the execution of a new job after our previous one
New pods are being deployed for the A100 job
Job using an A100 has been completed in a few minutes
Confirming via quota monitoring the allocation of a A100 GPU
T4 and A100 nodepool scaled back down to 0 after execution of the job

It’s interesting to note on the screenshot above the dynamicity and elasticity of this approach. When requesting a specific type of GPU, GKE automatically provisioned new nodes in the respective node pools, and then automatically de-provisioned those nodes after the execution of the jobs. This allows us to save money when jobs don’t run and reallocate GPUs back into the DWS pool.

Here below I wanted to highlight the detail of a provisioning request, we can see the Kind being used by the GKE cluster to create this object. The job being ran is referenced as well as the queuing mechanism. We can see the current status and transitions reflecting where the job is at, currently queued and waiting for a node to be provisioned.

For fun I had actually created a job requesting a H100, those are in really high demand and low availability, hoping there may be a (magic) loophole to get some without committing to then, but …no…tough luck… my job is still being queued (3 days) as you can see below :). If I had the quota granted and capacity for H100 it would work successfully the same way as we saw with other GPUs.

We were just able in a few minutes to get T4, A100 GPUs without doing much effort to be honest, as long as we are able to package our job in a containers and using Kueue as a queue system to request GKE to provision the required capacity to run our jobs.

Conclusion

When it comes to thinking through “how to manage GPU capacity in the cloud” there were already multiple options available and various factors to consider :

  • Dedicated node pools for specific GPU/CPU hardware or machine sizes/shapes
  • Geo-distribution across clusters, regions and zones
  • Usage of spot instances to lower cost as long as capacity is available
  • Backup node pools with reservations to guarantee capacity
  • Capacity to convert a job from Spot to standard VMs as a fallback mechanism when capacity runs scarce
  • Committed-Used Discounts (CUDs) applied with Reservations to lower the cost of compute
  • Orchestration and queuing mechanism to maximize usage of compute infrastructure for distributed jobs.

DWS now offers another option and complete the portfolio of tricks available to manage capacity. It’s a very interesting feature that come very handy to maximize GPU usage in the cloud vs. having to build a layer of orchestration for it. It also comes at a very interesting time, where GPU capacity is becoming harder to come by, and offering another avenue to get some of it !

From what I heard there are even more features coming down the pipe around GKE usage for large scale computational jobs, so stay tuned and I hope you will be able to snag some GPUs through the Tetris mastermind implemented in DWS !

Photo by Aedrian on Unsplash

--

--