InstaScale: Aggregate node scaler for guaranteed execution of AI workloads

Abhishek Malvankar

Published in

CodeFlare

4 min readMay 1, 2023

InstaScale: Aggregate node scaler for guaranteed execution of AI workloads

Authors: Abhishek Malvankar, Alaa Youssef, Diana Arroyo, and Olivier Tardieu

Introduction

Kubernetes continues to evolve to be the powerhouse for training (batch) workloads. Thanks to significant resource demands added by workloads such as Foundation models, many users turn to the cloud to train such resource-hungry models. Resource (node/pod) scaling becomes a critical aspect when moving to the cloud for patterns where workloads acquire (many) resources when needed and release them as soon as the workload is completed.

Foundation model workloads typically need a group of nodes on which frameworks clusters like Spark, Ray, and Pytorch are spawned. The only way such framework cluster workloads can make progress is, if all the pod's resources (i.e. resources for head node and N worker nodes) needed are available, let's call it acquiring gang (all-at-once) of resources.

When capacity is not enough for running framework clusters additional resources (nodes) need to be acquired in a gang (all-at-once) fashion. Such functionality is needed to avoid starting partial jobs i.e. framework clusters with either only worker nodes or head nodes. This causes resource wastage leading to high cloud costs.

In this blog, we introduce a new controller called InstaScale which works with our no-code/low-code queuing system Multi-Cluster App dispatcher (MCAD) to acquire aggregated resources for the target framework cluster (user job).

Problem

Training large AI models like the foundation model is experiment-driven where users submit multiple jobs to train the model and finally adopt the model to several downstream tasks. Infrastructure engineers or platform teams use a shared environment with bursting capability in the cloud where multiple users are onboarded and compete to run such experiments.

Assuming the shared environment is Kubernetes, it operates on pods. Kubernetes takes workloads to desired state only if a pod is spawned. A similar mechanism is used for scaling the cluster, the cluster autoscaler will add nodes if pods are in a pending state, and such pending pods are placed on the newly added nodes by the Kubernetes scheduler. Competing user's workloads can start partially in such a scenario, lets us understand this from the example below

Partial framework clusters (Jobs) launched for competing users

User#1 (blue) submits a workload that has 1 head node and 1 worker node and user# 2 (green) also submits another workload that has 1 head node and 1 worker node. Initially, all 4 pods will remain pending, this will cause cluster autoscaler to add nodes sequentially. As nodes become active/available in the Kubernetes cluster, the scheduler places pods on each of the nodes. The result is that both workloads have individual pods but none of them can make progress which leads to the hogging of resources and wastage.

Solution

InstaScale: Aggregate node scaler for guaranteed execution of AI workloads

InstaScale is a controller under project-codeflare that addresses the issue of partial job launch by bringing in aggregated resources needed for the entire target job on OpenShift. It provides a guarantee for running target user jobs. The acquired resources are tainted and labeled for a target workload by the InstaScale controller which allows us to hint scheduler to place the desired user workload pods.

For the same scenario of Two users (blue and green) mentioned in the Problem section, with deployed InstaScale you will observe the below behavior:

InstaScale avoids partial frameworks cluster (jobs) submitted by competing users

All pods belonging to User#1 (blue) are allocated while the User#2 pods are not created since no resources are available. It was possible to schedule User#1 pods to the resources acquired with the help of labels and tainting. User#2 pods will be created as soon as User#1 pods are removed or run to completion. Scaling is done for users with quota limitations at the cluster level. Thus InstaScale will avoid resource wastage when both users compete for resources, reduce cloud, and scale within cluster quota limits.

Below are some of the key features:

Proactive scaling before pending pod(s) is created.
Scaling for the entire Job.
Provides reuse of acquired resources for the next queued job.
Provides aggregated scale-down (scale-down to zero) of resources when no job needs previously acquired resources.
Provides quota limits per cluster for node scaling.

Summary

We understood some limitations of the existing scaling mechanism and introduce InstaScale, a new controller that works with OpenShift to acquire aggregated nodes for guaranteed workload execution and avoid resource wastage causing high cloud costs. Please reach out to us with any questions on the InstaScale GitHub repository.

Written by Abhishek Malvankar