Hybrid Cloud Setup for Deep Learning on Amazon EKS

GPU Spot Instances on Amazon Elastic Kubernetes Service’s Cluster with Autoscaling-from-0 Feature

Jia Yi Chan
ViTrox-Publication
14 min readJan 6, 2022

--

Amazon EKS

Introduction

Currently, there is a K8s cluster with 16 GPUs on hand. However, intensive training tasks are exhausting and the available GPUs are stalling training progress. There is a need to over-spill the training workload into external clusters to ensure the generation of accurate models at pace. The straightforward answer would definitely be deploying on cloud service providers such as AWS, Azure and GCP. However, cost and security issues would immediately surface as major obstacles for cloud deployment.

In this document, I would attempt to solve model training overload issues using cloud servers and mitigate cost and security issues.

Pre-requisite

Before we proceed, please make sure that you have installed kubectl, AWS CLI and eksctl. Also, please make sure that the AWS IAM user has sufficient permission to work with EKS.

eksctl

eksctl is a simple CLI tool that creates and manages clusters on EKS - Amazon's managed Kubernetes service for EC2.

AWS CLI

AWS CLI is a unified tool to manage your AWS services. With just one tool, you can control multiple AWS services from a command line and automate them with scripts.

Auto Scaling Group (ASG)

An Auto Scaling group contains a collection of Amazon EC2 instances that are treated as a logical grouping for the purposes of automatic scaling and management. An Auto Scaling group also enables you to use Amazon EC2 Auto Scaling features such as health check replacements and scaling policies. Maintaining the number of instances in an Auto Scaling group and automatic scaling are the core functionalities of the Amazon EC2 Auto Scaling service.

Spot Instance

The spot instance is an instance that uses spare EC2 capacity that is available at less than 70% of the on-demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price. The Spot price of each instance type in each Availability Zone is set by Amazon EC2 and is adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available and the maximum price per hour for your request exceeds the Spot price.

IAM Policies

In order to use Amazon EKS, you need an account with access to several security permissions. Security permissions can be set with AWS Identity and Access Management (IAM). The following are the required policies:

  • EKSFullAccess
  • AWSCloudFormationFullAccess
  • AmazonEC2FullAccess
  • IAMFullAccess
  • AmazonEC2ContainerRegistryReadOnly
  • AmazonEKS_CNI_Policy
  • AmazonS3FullAccess

You can view the detailed guide here, and search for the Security Settings sections.

Working Instruction

Step I: Set Up AWS CLI

AWS CLI version 2

After installation is completed, set up the AWS CLI with aws configure command and both existing and credentials files. Before starting, make sure you have already acquired the key ID and secret access key of the IAM user. If you haven't created an IAM role, do refer here.

Below are examples of what AWS CLI will prompt you with. You are required to substitute your own values for each information.

Step II: Create VPC with Public and Private Subnets

EKS normally creates a VPC automatically during cluster creation. If you wish to have more control over the cluster resources, it is recommended to perform a provision on your VPC manually. For more VPC creation information, you can refer here. Generally, you would be required to consider the VPC architecture to be either:

  • Public Subnet only, or
  • Private Subnet only, or
  • Public and Private Subnet only

Amazon EKS recommends to run the cluster in a VPC with public and private subnets so that Kubernetes can create public load balancers in the public subnets that load balance traffic to pods running on nodes that are in private subnets. This configuration is not required, however, you can run a cluster in a VPC with only private or only public subnet, depending on your networking and security requirements

Note: In this configuration, worker nodes are instantiated in the private subnets and NAT Gateway are instantiated in the public subnets. If you prefer building stacks for the private- or public-only subnet, do check out other available templates here.

VPCs can be created manually using two methods: AWS Console or AWS CLI.

Option I: Using AWS Console:

  1. Open the AWS CloudFormation console.
  2. Select Create Sack, with new resources (standard).
  3. Under Prepare template, make sure that Template is ready is selected and then under Template source, select Amazon S3 URL. Paste the following URL into the text area: https://amazon-eks.s3.us-west-2.amazonaws.com/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml
  4. Select Next, specify the parameters accordingly on the Specify Details page. Then choose Next.
  5. (Optional) Tags the stack resources. Click Next, Create Stack.
  6. Your VPC Stacks will be created within 10 minutes. You can view it in the VPC console.

Option II: Using AWS CLI:

  1. If you prefer to have more flexibility on the resources and network configurations, you may choose to ignore this step and proceed directly to step 2 to download and edit the configuration template. Or, you could also use AWS templates without making modifications, which is as easy as running the following commands.

Note: You may replace my-eks-vpc-stackwith your own stack name.

2. As an alternative, you may download https://amazon-eks.s3.us-west-2.amazonaws.com/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml and edit the template. In my case, I would like to add tagging to VPC for billing purposes. Then, apply the command below to create a VPC stack:

Sample Output

The stacks will be created in a few seconds. The sample output will be:

Now, you’re able to view the stack at CloudFormation.

Stacks information on the CloudFormation.

The VPC Dashboard also provides access to other resources like Subnets, NAT Gateways, Route tables, and more. Please take note of the IDs for VPC and all subnets for the network configurations in the next section.

VPC Dashboard

Step III: Config YAML File for EKS Cluster

Create a YAML file with the configurations below. You’re required to attach your own VPC and subnets IDs that were created previously.

Note: This configuration creates an EKS Cluster in a VPC that contains public and private subnets. If you prefer an architecture with a private subnet only, you may refer to the configuration here.

A new EKS cluster with 2 Spot t3.micro instances will be created.

Here are a few points to consider:

  • Line 15–24: Both VPC and subnets ID that are provisioned in the previous session has to be specified.
  • Line 33–35, 61–63, 94–96: The node groups are marked so that an auto-scaler can control them.
  • Line 41, 100: spotAllocationStrategy: capacity-optimized is one of the allocation strategies to minimize the chance of Spot Instance interruptions. This strategy examines the available capacity pool for each instance type in every availability zone, and instance types with the highest capacity will be chosen to be launched, which means that they will not be affected by price fluctuations and remain active until the free capacity level changes. Thus, the number of interruptions to your instance and service will be minimal. Alternatively, you can use spotAllocationStrategy: capacity-optimized-priotized to specify the priority order of the instance types of your preference. Else, you may use spotAllocationStrategy: price-optimized strategy so that the lowest-priced instance always be chosen. Remember that the price fluctuations will happen more frequently, which results in a higher rate of interruptions.
  • Line 53, 76, 110: kubernetes.io/cluster/<name> tag set to owned . The Cluster Autoscaler requires this tag on your Auto Scaling groups so that they can be auto-discovered during scale up or down.
  • Line 44:intent to allow you to deploy specific applications on labelled nodes. For example,intent: apps allows you to deploy stateless applications on nodes that have been labelled with apps; or you use the label intent=control-apps to deploy control applications on nodes that are labelled with value control apps.
  • Line 55, 88 amiFamily: AmazonLinux2 : Indicates the use of the EKS AMI based on Amazon Linux 2. You may specify other OS based on your preference by referencing here).
  • Line 85, 119privateNetworking: true : When placing the node groups inside a private subnet, privateNetworking must be set to true on the node groups. This ensures that only private IP addresses are assigned to EC2 instances.
  • Line 83, 117: The GPU node group has a taint that prevents cluster infra services from getting scheduled on them.
  • Line 118–123: If you are using nodeSelector and taints, you need to tag the ASG with a node-template key [k8s.io/cluster-autoscaler/node-template/label/](<http://k8s.io/cluster-autoscaler/node-template/label/>) and [k8s.io/cluster-autoscaler/node-template/taint/](<http://k8s.io/cluster-autoscaler/node-template/taint/>). When scaling from ‘0’, the node taints and labels are not visible in the Launch Configuration, hence tagging the resources will enable Cluster Autoscaler to discover available instance types from zero. More information for ASG tags can be read here.
  • Lines 86, 120 availabilityZones: ["us-east-1b"]: Specify the AZ for the node groups to run on — running in the same zone is important to avoid cross AZ network charges (e.g. when the spot instance is interrupted and a new instance is spawned in a different AZ which can be significant. If you prefer running a node type in a different AZ, you may duplicate the whole configurations of the corresponding group type and specify another AZ for it.

Step IV: Create Cluster

To create an EKS cluster from the configuration file, execute the following command:

Creating a cluster takes 15–25 minutes, and you’ll see many lines of output. The last one looks like the following:

Sample Output

Step V: (Optional) Deploy kube-ops-view

Kubernetes Operational View (kube-ops-view) is a simple web page to visualize the operations of the Kubernetes cluster. During cluster auto-scaling operations, you can observe how scale-ins and scale-outs occur visually, though they’re not used for monitoring and operations management.

  1. Follow the guide to install Helm.
  2. Install kube-ops-view.

3. From the result below, copy and paste the domain address in the EXTERNAL-IP into the web browser.

4. You can browse the URL and check the status of the cluster you have currently deployed.

Step VI: Create Auto Scaler

  1. Create an IAM OIDC provider for your cluster. v3-cluster are required changed to your cluster name.

JSON Web Token (JWT) is used by OpenID Connect (OIDC) layer to share security information between a client and a server. OIDC JSON web token is a new feature added in Kubernetes version 1.12. Now, EKS hosts a public OIDC discovery endpoint per cluster, enabling a third party such as IAM to validate the end-users and receive their basic information.

2. Create IAM Policy and Role

a. Save the following contents to a file that’s named cluster-autoscaler-policy.json. If your existing node groups were created with eksctl and you used the --asg-access option, then this policy already exists and you can skip to Step 2-b.

  • { "Version": "2012-10-17", "Statement": [ { "Action": [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeTags", "autoscaling:SetDesiredCapacity", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeLaunchTemplateVersions" ], "Resource": "*", "Effect": "Allow" } ] }

b. Create the policy with the following command. You can change the value for policy-name.

Take note of the Amazon Resource Name (ARN) that’s returned in the output. You need to use it in a later step.

3. Set Up Node Auto Scaling with Kubernetes Cluster Scaler. Follow the command below to install Cluster Autoscaler:

When using separate node groups per zone, the -balance-similar-node-groups flag will keep nodes balanced across zones for workloads that do not require topological scheduling. (source) The skip-nodes-with-system-pods=false flag will enable auto-scaling for the nodes consisting of kube-system pods so that the kube-system pods can reside on the least node.

4. Then, check the existing running pods to ensure the auto-scaler has been deployed smoothly.

Autoscaler pod is running in the cluster node.

You can also view the pods running in kube-ops-view.

View the running pods of autoscaler using Kube-Ops-View.

Step VII: Scaling GPU Nodes from 0 with Kubernetes Deployment

Code References to EKS GPU Cluster from Zero to Hero.

Now, we are going to test the scaling of GPU nodes from 0. Before starting, list the worker nodes that are currently running in the cluster.

Then, creates a vector-add-dpl.yml file. The following is a sample of a Deployment with a single attached GPU resource which rolls out a ReplicaSet to bring up nginx Pods. Kubernetes deployment with 10 pod replicas with nvidia/gpu: 1 limit.

Apply the manifest file to the cluster by following the command below:

You can observe the Auto Scaler operations through the kube-ops-view:

Before scaling, the pods are waiting for the launches of the node:

The pod is waiting to be placed into a GPU node.

Or, you can also check the real-time auto-scaling logs using kubectl:

The scale-up logs will be shown as below:

Instances will be scaled up around 15- minutes. After scaling, you will see that pod has been placed into a node:

A GPU node launches from 0.

Pod information can be viewed through the following commands:

Sample Output of Pod Events:

Pod Events

Sample Output of Running Nodes:

You may also test the scaling by adjusting the replicas and checking the results:

Step VIII: Deploy AWS Node Termination Handler

With this AWS Node Termination Handler (NTH), Kubernetes controls can react appropriately to all kinds of events that may cause your EC2 instances to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, ASG Scale-In, ASG AZ Rebalancing, and EC2 Instance Termination via the API or Console. Else, the application code may fail to stop gracefully, take longer to recover full availability, or accidentally schedule work on unavailable nodes if not handled properly.

Note: you may ignore the if you are using managed node groups, as it will be handled automatically by EKS.

Deployment Output:

Deployment output.

Then, you will see a new running pod exist in the kube-system.

Last Step: Clean Up Cluster

Last but not least, make sure you delete all the resources created at the end of the tutorial to avoid being shocked when you receive your AWS bill at the end of the month.

Sample Output:

References

[1] Nullius. (2019). EKS GPU Cluster from Zero to Hero. https://alexei-led.github.io/.

[2] Taber. (2020) De-mystifying cluster networking for Amazon EKS worker nodes | Amazon Web Services. Amazon Blogs.

[3] Iglesias, Claudia, Sunil. (2021) 23-kubeflow-spot-instance Configurations. GitHub.

[4]Sunil, A. (2020). How we Reduced our ML Training Costs by 78%!!. Medium.

[5] Pinhasi, A. (2021). A Step by Step Guide to Building A Distributed, Spot-based Training Platform on AWS Using TorchElastic and Kubernetes. Medium.

[6] Autoscaling. (2019) Amazon EKS User Guide Documention

--

--