Taming EKS for Machine Learning on a Budget

Published in

Cortico

11 min readApr 23, 2020

Autoscaling-from-0 GPU Spot Instance node groups on Amazon’s Elastic Kubernetes Service, using CloudFormation templates

At Cortico, we’ve maintained two separate computing infrastructures: a Kubernetes cluster on Amazon EKS for our production application and associated services, and a combination of cloud and on-site dedicated GPU machines for machine learning model training and other large, parallelizable batch jobs. This caused some integration headaches whenever we wanted to put any ML models into production, though, and we usually ended up with significantly slower results on the k8s cluster than on the GPU machines to boot. Plus, we’re a nonprofit, and dedicated GPU machines are expensive.

Ideally, we would like to be able to submit CUDA jobs to the cluster from the start, simply requesting GPU resources from the cluster like we would CPU or memory, while minimizing cost as much as possible. In this post, we’ll cover how we added autoscaling-from-0 GPU-enabled EC2 Spot Instances to our existing EKS cluster, and distill some of the lessons learned in the process, both about the options we chose and those we did not.

Anatomy of an EKS cluster

While it is possible to manage a Kubernetes cluster on Amazon Web Services directly (often by using a framework like Kops), Amazon offers a managed Kubernetes solution in the form of Elastic Kubernetes Service (EKS). An EKS cluster’s master node and control logic is managed by AWS, so we trade fine-grained control over (and visibility into) the control plane for guaranteed uptime and less maintenance hassle. However, much like unmanaged options, the rest of the cluster is much more ad-hoc: EKS is simply the orchestration layer for hardware and software functionality distributed over a range of other AWS services.

An EKS cluster’s master node controls worker nodes in the form of Elastic Compute Cloud (EC2) instances in one or more node groups (EC2 Auto Scaling Groups) running the Kubelet node agent software. All of this exists within a dedicated Virtual Private Cloud (VPC) to isolate network interfaces and ensure all nodes can communicate with the master (as specified by network traffic rules defined in EC2 Security Groups), and each component’s access to AWS functions is managed through Identity and Access Management (IAM) policies and roles. All of these pieces have traditionally been managed through Amazon CloudFormation stacks, which package together an arbitrary collection of AWS components for simultaneous deployment, updates, and deletion.

These days, however, Amazon’s recommended way to create and manage EKS clusters is through eksctl, a third-party tool that has been adopted as a standard interface. It offers some nice CLI functionality, and abstracts away a lot of the initial complexity of setting up all the moving parts. Under the hood, though, it’s really just a cleaner interface into CloudFormation stacks: eksctl create cluster will generate and execute a CloudFormation template that provisions a VPC, an EKS master node, an EC2 Auto Scaling Group (ASG) with some default number and type of instances, security groups for the master and worker nodes, and a worker node IAM role with a minimal set of permissions. Any or all of these default behaviors can be overridden and configured through command line flags or yaml configuration files. Once that cluster exists, it can be monitored, modified, updated and deleted through similar eksctl interface options.

Unfortunately, clusters that were created through older means (the console and manually deployed CloudFormation stacks) cannot be managed with eksctl, and our cluster was provisioned in the early days of EKS. However, since we understand what eksctl is doing, we can replicate it in CloudFormation directly. Instead of rebuilding our cluster, we decided to write the new nodegroups as a bare CloudFormation stack.

Mixed-instance Spot nodegroups

CloudFormation templates can be pretty daunting, but we can start with Amazon’s sample templates. Following their guide to EKS with the console will get you pretty close, though I found the sample CloudFormation template in their Spot Instance EKS workshop to be closer to what we wanted (here’s the download link if you want to follow along). Spot Instances are easier on the budget, while having a mix of instance types helps to mitigate some of their uptime and resilience problems.

Most of our modifications are in the AutoScalingGroup definition:

The first lines of our Auto Scaling Group definition are reasonably self-explanatory: the group will attempt to maintain DesiredCapacity (zero) running instances in the absence of other instructions, scaling down to MinSize or up to MaxSize as required by any scale-up or scale-down conditions specified in the ASG by the user, or in our case, by the Cluster Autoscaler tool (more on that later). All of these variables except the NodeLaunchTemplate are defined in the parameters section near the beginning of the file (mostly unchanged from the EKS Workshop sample). I’ve noted in comments which parameter values were especially important for this specification, some of which were changed from the sample defaults.

In an ASG resource definition, one of three options must be specified: LaunchConfiguration, LaunchTemplate, or MixedInstancesPolicy. In order to run a Spot-instance based ASG, we need to use the MixedInstancesPolicy, which allows us to specify the base number of On-Demand instances we want (none) and the percentage of additional nodes we want to be On-Demand when a scale-up occurs (also none).

We’ll instead be using 100% Spot Instances, which are organized into “pools”, of which there is one per instance type, per availability zone, for a total of (InstanceTypes * AZs) pools. Generally, it’s recommended to have a large number of available pools to increase the overall resilience and uptime of Spot instance groups, but it depends on the use case (constant uptime isn’t a huge priority for ours). As Cluster Autoscaler doesn’t support nodegroups that span multiple availability zones, the number of available pools for each of our single-AZ nodegroups is equal to the number of instance types listed.

The allocation algorithm will choose the SpotInstancePools best pools from these options, based on the allocation strategy: “lowest-price” chooses the cheapest instance pools (naively — we’ll revisit this in the next section) and “capacity-optimized” would choose the instance pools in which Amazon currently has the most spare capacity, which helps to minimize interruptions to running Spot instances from AWS reclaiming needed capacity. Unfortunately, at the time of writing, only “lowest-price” is currently supported by ASGs. MaxPrice simply prevents instances from launching if their price exceeds this value — you always pay the market price for spot instances regardless of this value, and as of the update to Spot pricing in 2018, these prices are much more stable over time, so MaxPrice isn’t as critical to specify.

Somewhat confusingly, while MixedInstancesPolicy is an alternative to specifying a LaunchTemplate, it also requires a Launch Template as a parameter, which is defined later in the sample CloudFormation stack (we didn’t make any notable changes). We specify the name and version number by reference to that definition. Both LaunchConfiguration specifications and LaunchTemplate specifications are defined with a single default InstanceType parameter, but Launch Templates come with a separate option at the point of reference called Overrides, where you can override the default instance type with a list of instance types. Here, we specify g3s.xlarge, g3.4xlarge, and p2.xlarge, three relatively inexpensive single-GPU instance types. We’ll discuss the rationale below.

GPU instances: EC2 options and limitations

AWS offers a few options for GPU-enabled instance types. These are the g- and p- series instances: g2, g3, and g4 types are generalized GPU instance types with additional optimizations for graphics-intensive applications like gaming, while p2 and p3 instances are optimized for CUDA and machine learning applications. They are certainly not limited to their respective intended purposes, though, and we’ve chosen a mixture of instances from the second-latest generations of each series (g3 and p2) for their balance of price, performance and long-term support. We can run the EKS-optimized AMI with GPU support (corresponding to our Kubernetes version and region) on any of these instance types, which comes with Nvidia drivers, nvidia-docker support, and the Nvidia runtime active by default (this will become important later on). Cost per GPU-hour generally increases with the newer generations, though this should be balanced against the newer instance types’ faster time-to-run on finite jobs when estimating total cost. Cost, moreover, is a significant factor when considering GPU instances in general, as they all come with a hefty markup per instance-hour over comparable CPU-only machines.

We’re using only small, cheap instance types, despite the fact that larger instances may be cheaper per GPU-hour, depending on fluctuations. Why? Because of the limitations of the lowest-price allocation strategy and standard Spot ASGs. Resource-based allocation is a feature of Spot Fleet — you specify weights for instance types based on some resource capacity you care about (e.g. CPU cores, memory, GPUs), and the allocation logic will find some combination of instance types from a list of allowed types that satisfies your conditions as cheaply as possible. You could also specify granular weights based on other metrics — for example, if you know that your workload tends to run 25% faster on a g3.4xlarge than a p2.xlarge, you could specify that the g3.4xlarge is 25% more valuable to you per hour.

However, these options are not currently supported in Spot ASGs (which are incompatible with Fleet), so until further notice we’ll need to use the naive allocation algorithm. This means lowest-price allocation will choose the instance pool with the lowest price-per-instance-hour, no matter the size of the instance. (e.g. a 1-GPU instance costs 25¢ per hour while a 4-GPU instance costs 30¢ per hour. You need 4 GPUs. The allocation algorithm chooses the 1-GPU pool and provisions 4 1-GPU instances, because 25¢ is less than 30¢.) Given these conditions, we chose three small instance types of similar price and functionality, in lieu of more granular allocation strategies.

GPU instances: interface with Kubernetes

While we now have machines with GPU hardware, this functionality is by default not visible to Kubernetes. To make GPU resources discoverable and schedulable by the cluster, we need an additional component: the nvidia-device-plugin daemonset. This will make resources of type nvidia.com/gpu visible to k8s, so they can be scheduled with requests/limits like CPU and memory. This is visible in the output of a kubectl describe call on a running GPU node, as seen below:

Capacity:
 attachable-volumes-aws-ebs:  39
 cpu:                         4
 ephemeral-storage:           104845292Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      31390692Ki
 nvidia.com/gpu:              1
 pods:                        58
Allocatable:
 attachable-volumes-aws-ebs:  39
 cpu:                         3920m
 ephemeral-storage:           95551679124
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      27685860Ki
 nvidia.com/gpu:              1
 pods:                        58...Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests   Limits
  --------                    --------   ------
  cpu                         110m (2%)  0 (0%)
  memory                      0 (0%)     0 (0%)
  ephemeral-storage           0 (0%)     0 (0%)
  attachable-volumes-aws-ebs  0          0
  nvidia.com/gpu              0          0

For the daemonset pods to work correctly, the machine must have the Nvidia runtime enabled, which (as noted earlier) the GPU AMI does by default. We deploy the daemonset to the cluster with a nodeSelector condition that ensures it will only run on these GPU instance types, to not waste computing resources on normal nodes.

Finally, we use Kubernetes taints and tolerations to ensure that no normal jobs are scheduled onto the ephemeral and more expensive new nodes. We use the hard-limiting NoSchedule option for our nvidia.com/gpu taint, to entirely prevent non-GPU workloads from running on the GPU nodes, and the soft-limiting PreferNoSchedule option for the spotInstance taint, so otherwise-appropriate workloads will only be scheduled onto them if no other options exist (this isn't relevant unless we later decide to, for example, purchase always-on GPU reserved instances). Taints, like Kubernetes labels, only exist on running nodes, so there's a bit more to do to get this working properly when scaling from 0, as discussed in more depth in the “tags, labels, and scaling from 0" section below.

Note for kube2iam users: make sure the kube2iam daemonset tolerates the nvidia.com/gpu taint on the new gpu nodes! If you want kube2iam pods to run on any and all nodes, you can do this with a wildcard toleration:

tolerations:
- operator: "Exists"

This goes under spec.template.spec in the kube2iam config. It will tolerate the nvidia.com/gpu taint, along with all other taints.

Cluster Autoscaler: permissions

The Kubernetes Cluster Autoscaler is the de-facto standard tool for horizontal autoscaling of nodegroups in response to load. It is not installed by default on the master node, however, instead usually running as a regular Deployment resource on the worker nodes. (In our case, this is the only option, as we do not have direct access to the EKS master.) As it performs cluster-level operations from a worker node, though, there is a bit of setup required to get everything working smoothly.

The simplest workflow to get CA working on EKS involves adding the permissions required for CA to the worker node IAM role. This works on most clusters, since by default, pods running on a node inherit the node’s permission set. However, this isn’t a particularly clean or secure approach, as you’ve now added more (and usually unnecessary) permissions to any running pod on your cluster. A better approach is to set up kube2iam to give each pod only the specific IAM permissions it needs to do its job. As our cluster was already configured with this tool, the only additional steps needed were to create an IAM role (eks-cluster-autoscaler) with the necessary permissions, and to add the annotation iam.amazonaws.com/role:eks-cluster-autoscaler to the Cluster Autoscaler deployment spec under spec.template.metadata.annotations.

Cluster Autoscaler: tags, labels, and scaling from 0

In addition to permissions, CA needs to know which nodegroups it is responsible for monitoring and what types of resources they have. This is achieved through a combination of EC2 tags and Kubernetes labels. EC2 tags apply to both the ASG as a whole, and to each individual instance in that ASG (with the propogateAtLaunch property set to true). For CA to recognize a nodegroup as under its control, the ASG must be tagged with the kubernetes.io/cluster/${ClusterName} key set to owned and the k8s.io/cluster-autoscaler/enabled key set to true. To make the node properties visible to Kubernetes, we need the Kubernetes labels of lifecycle=Ec2Spot, nvidia.com/gpu=true, and k8s.amazonaws.com/accelerator=nvidia-tesla*. These are set via the arguments to the AMI's bootstrap.sh Kubelet startup script, which is called in the UserData section of the NodeLaunchTemplate:

Unfortunately, Kubernetes labels only exist on running nodes, and if we want CA to correctly scale from zero, it also needs to be aware of some of these properties while no nodes exist. To this end, we duplicate the nvidia.com/gpu=true label in the EC2 tags, in the form of k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu set to true, along with the GPU taint duplicated similarly (k8s.io/cluster-autoscaler/node-template/taint/dedicated set to nvidia.com/gpu=true). With these tags, Cluster Autoscaler has all the information it needs to schedule GPU jobs appropriately, even when no GPU nodes currently exist.

*This varies depending on the version of your Kubernetes cluster (and the corresponding Cluster Autoscaler version): see the CA documentation

Deploying onto the GPU Spot nodes

With all these pieces in place, we can now schedule Kubernetes resources onto GPU nodes. A resource definition should select the nvidia.com/gpu label and tolerate the nvidia.com/gpu taint (and optionally request some specific number of GPUs for better scheduling), like in this example Deployment:

And…that’s all! Ideally, everything should work as follows:

Kubernetes spec submitted, selecting nvidia.com/gpu labeled nodes (and/or requesting nvidia.com/gpu resources)
Cluster Autoscaler detects unschedulable pods and requests a scale-up of an appropriate ASG
EC2 spins up additional instance(s), which load Kubelet software and join the cluster
The nvidia-device-plugin daemonset schedules a pod onto the node to expose the available GPUs
Pods from the submission will be scheduled onto the new nodes, run to completion (or until deletion), then free up their resources
Cluster Autoscaler detects the unused nodes, and after 10 minutes, requests a scale-down
EC2 detects the scale-down and shuts down the instances.

Good luck!