Packing workloads on AI supercomputer in the cloud

Published in

CodeFlare

4 min readMar 15, 2023

Authors: Abhishek Malvankar, Alaa Youssef, Asser Tantawi, Olivier Tardieu, and Carlos Costa

Introduction

To cope with the massive compute requirements for training foundation models, IBM has built its first cloud-native supercomputer, Vela. Housed within IBM Cloud, it is currently being used mainly by the IBM Research community. Vela is built using a cloud-native software stack and runs on the Red Hat OpenShift platform, leveraging Kubernetes capabilities for container orchestration.

Fragmentation is a phenomenon where one or more (possibly expensive) resources are used inefficiently (across nodes) in a cluster. AI workloads typically need expensive accelerators like GPU(s) to run the computation. Wastage of such resources will lead to low cluster utilization and high costs.

In this article, we describe an issue we faced with the fragmentation of resources in the system, explain Kubernetes scheduler configurations that were made for large-scale GPU-based OpenShift cluster Vela to drive utilization with help of packing (at GPU dimension) that can result in a speedup of workload execution and potential cost savings in the cloud environment.

In future articles will describe more advanced scheduling techniques to address such issues in a more efficient way.

Configuration

The bin packing problem is an optimization problem, in which items of different sizes must be packed into a finite number of bins or containers, each of a fixed given capacity, in a way that minimizes the number of bins used.

Kubernetes scheduler is a highly configurable system that provides various extension points to configure scheduling behavior. The default policy is “LeastAllocated” for the plugin “NodeResourcesFit”, use case for such a policy would be to enable reliability and HA configuration of microservices as it ensures a single node failure cannot takedown every replica at once. For AI workloads the “LeastAllocated” policy does not work, especially for training workloads. Spreading the workload instead of packing increases latency to execute workloads in the communication phase.

Let’s consider the below case where spread policy can cause underutilization of nodes, Cluster has 2 nodes with 8 GPUs each. Each node has 4 GPUs used, if the next job that requests 8 GPUs on a single node arrives then none of the nodes have 8 free GPUs available due to fragmentation as shown below:

8 GPU workload remains pending as none of the nodes have 8 GPUs available as an aggregate

Kubernetes scheduler provides several scoring functions. The scoring function is configurable and users can define the score function corresponding to different utilization rates to determine the decision-making process of scheduling.

Users (admins) at Vela had to update the scheduler configmap as below:

The above policy would mean that the scheduler performs packing at the GPU dimension with a linear scoring function. Nodes with higher resource utilization get higher scores. If the utilization rate is 0 then the score is 0 and if the resource utilization rate is 100 then the score is 10. Shown below is the plotted policy:

Deployment

Backup existing configmap using the command:

oc get configmap scheduler-config -n scheduler-plugins -o yaml > prev-cm-coschd.yaml

Admin needs to edit/patch scheduler config using the configuration shown in Figure 1 and execute the command:

oc apply -f scheduler-configmap-packing.yaml

Verify the updates applied to the configmap in “.data” section using the command:

oc get configmap scheduler-config -n scheduler-plugins -o yaml

Scale down the deployment of the co-scheduler to zero using the command:

oc scale deployment scheduler-plugins-scheduler -n scheduler-plugins –replicas=0

After the scale-down completes scale up the scheduler deployment using the command:

oc scale deployment scheduler-plugins-scheduler -n scheduler-plugins –replicas=1

Testing

The Scheduler Plugins repository provides various extensions of the default scheduler. We use the co-scheduler from the same repository to spawn multiple pods in a group called “podgroups”. This testing can also be done with the default scheduler but we feel the co-scheduler plugin is more suitable for AI workloads as it provides gang scheduling i.e., groups of pods tied to the workload can be scheduled together in the system.

Below are the steps executed in the test environment to enable packing:

Create podgroup with name packing-test-1-pg with minMember as 2 and batchv1/Job that spawns 3 jobs (completions == parallelism ==3)
Submit the below test yaml using the command:

oc apply -f packing-test-1.yaml

Submit the second job using the command:

oc apply -f packing-test-2.yaml

Binpacking will cause both jobs to be landed on the same node:

Packing workloads on AI supercomputer in the cloud

Written by Abhishek Malvankar