Hybrid Cloud Setup for Deep Learning on Amazon EKS

GPU Spot Instances on Amazon Elastic Kubernetes Service’s Cluster with Autoscaling-from-0 Feature

Jia Yi Chan

Published in

ViTrox-Publication

14 min readJan 6, 2022

Introduction

Currently, there is a K8s cluster with 16 GPUs on hand. However, intensive training tasks are exhausting and the available GPUs are stalling training progress. There is a need to over-spill the training workload into external clusters to ensure the generation of accurate models at pace. The straightforward answer would definitely be deploying on cloud service providers such as AWS, Azure and GCP. However, cost and security issues would immediately surface as major obstacles for cloud deployment.

In this document, I would attempt to solve model training overload issues using cloud servers and mitigate cost and security issues.

Pre-requisite

Before we proceed, please make sure that you have installed kubectl, AWS CLI and eksctl. Also, please make sure that the AWS IAM user has sufficient permission to work with EKS.

eksctl

eksctl is a simple CLI tool that creates and manages clusters on EKS - Amazon's managed Kubernetes service for EC2.

AWS CLI

AWS CLI is a unified tool to manage your AWS services. With just one tool, you can control multiple AWS services from a command line and automate them with scripts.

Auto Scaling Group (ASG)

An Auto Scaling group contains a collection of Amazon EC2 instances that are treated as a logical grouping for the purposes of automatic scaling and management. An Auto Scaling group also enables you to use Amazon EC2 Auto Scaling features such as health check replacements and scaling policies. Maintaining the number of instances in an Auto Scaling group and automatic scaling are the core functionalities of the Amazon EC2 Auto Scaling service.

Spot Instance

The spot instance is an instance that uses spare EC2 capacity that is available at less than 70% of the on-demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price. The Spot price of each instance type in each Availability Zone is set by Amazon EC2 and is adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available and the maximum price per hour for your request exceeds the Spot price.

IAM Policies

In order to use Amazon EKS, you need an account with access to several security permissions. Security permissions can be set with AWS Identity and Access Management (IAM). The following are the required policies:

EKSFullAccess
AWSCloudFormationFullAccess
AmazonEC2FullAccess
IAMFullAccess
AmazonEC2ContainerRegistryReadOnly
AmazonEKS_CNI_Policy
AmazonS3FullAccess

You can view the detailed guide here, and search for the Security Settings sections.

Working Instruction

Step I: Set Up AWS CLI

AWS CLI version 2

After installation is completed, set up the AWS CLI with aws configure command and both existing and credentials files. Before starting, make sure you have already acquired the key ID and secret access key of the IAM user. If you haven't created an IAM role, do refer here.

Below are examples of what AWS CLI will prompt you with. You are required to substitute your own values for each information.

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalxxxxxxEMI/K7Mxxxx/bPxRfiCxxxxxMPLEKEY
Default region name [None]: us-east-1
Default output format [None]: yaml

Step II: Create VPC with Public and Private Subnets

EKS normally creates a VPC automatically during cluster creation. If you wish to have more control over the cluster resources, it is recommended to perform a provision on your VPC manually. For more VPC creation information, you can refer here. Generally, you would be required to consider the VPC architecture to be either:

Public Subnet only, or
Private Subnet only, or
Public and Private Subnet only

Amazon EKS recommends to run the cluster in a VPC with public and private subnets so that Kubernetes can create public load balancers in the public subnets that load balance traffic to pods running on nodes that are in private subnets. This configuration is not required, however, you can run a cluster in a VPC with only private or only public subnet, depending on your networking and security requirements

Note: In this configuration, worker nodes are instantiated in the private subnets and NAT Gateway are instantiated in the public subnets. If you prefer building stacks for the private- or public-only subnet, do check out other available templates here.

VPCs can be created manually using two methods: AWS Console or AWS CLI.

Option I: Using AWS Console:

Open the AWS CloudFormation console.
Select Create Sack, with new resources (standard).
Under Prepare template, make sure that Template is ready is selected and then under Template source, select Amazon S3 URL. Paste the following URL into the text area: https://amazon-eks.s3.us-west-2.amazonaws.com/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml
Select Next, specify the parameters accordingly on the Specify Details page. Then choose Next.
(Optional) Tags the stack resources. Click Next, Create Stack.
Your VPC Stacks will be created within 10 minutes. You can view it in the VPC console.

Option II: Using AWS CLI:

If you prefer to have more flexibility on the resources and network configurations, you may choose to ignore this step and proceed directly to step 2 to download and edit the configuration template. Or, you could also use AWS templates without making modifications, which is as easy as running the following commands.

Note: You may replace my-eks-vpc-stackwith your own stack name.

aws cloudformation create-stack \\                                                                             --stack-name my-eks-vpc-stack \\                                                 --template-url <https://s3.us-west-2.amazonaws.com/amazon-eks/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml>

2. As an alternative, you may download https://amazon-eks.s3.us-west-2.amazonaws.com/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml and edit the template. In my case, I would like to add tagging to VPC for billing purposes. Then, apply the command below to create a VPC stack:

aws cloudformation create-stack   \\ --stack-name my-eks-vpc-stack   \\ --template-body file://aws/amazon-eks-vpc-private-subnets.yaml

Sample Output

The stacks will be created in a few seconds. The sample output will be:

StackId: arn:aws:cloudformation:us-east-1:xxxxxxxxxxxx:stack/my-eks-vpc-stack/9fxxxxxx-63c7-11ec-xxxx-0eaaxxx16x42b

Now, you’re able to view the stack at CloudFormation.

Stacks information on the CloudFormation.

The VPC Dashboard also provides access to other resources like Subnets, NAT Gateways, Route tables, and more. Please take note of the IDs for VPC and all subnets for the network configurations in the next section.

Step III: Config YAML File for EKS Cluster

Create a YAML file with the configurations below. You’re required to attach your own VPC and subnets IDs that were created previously.

Note: This configuration creates an EKS Cluster in a VPC that contains public and private subnets. If you prefer an architecture with a private subnet only, you may refer to the configuration here.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: v1-cluster
  region: us-east-1
  tags:
    department: CoE #resource tagging
    environment: dev #resource tagging
    
vpc:
  nat:
    gateway: HighlyAvailable # other options: Disable, Single
  id: 	vpc-0361b7xxxxxxxxxxx	
  subnets:
    public:
      us-east-1a:
        id:	subnet-0e4731ea7bxxxxxx78
      us-east-1b:
        id:	subnet-0899xxxxxxxxxxxxxb
    private:
      us-east-1a:
        id: subnet-0aaxxxxxxxxxxxxxx	
      us-east-1b:
        id: subnet-0xxxxxxxxxxxxxxxx
    nodeGroups:
  - name: cpu-spot
    minSize: 0
    maxSize: 10
    desiredCapacity: 1
    privateNetworking: true
    iam:
      withAddonPolicies:
        autoScaler: true
        albIngress: true          
    instancesDistribution:
      instanceTypes:
        - t3.small
      onDemandBaseCapacity: 0 #Specify Spot Instance
      onDemandPercentageAboveBaseCapacity: 0 #Specify Spot Instance
      spotAllocationStrategy: capacity-optimized
    labels:
      v1-cluster/capacityType: SPOT
      intent: apps
      type: self-managed-spot
    taints:
      spotInstance: "true:PreferNoSchedule"
    tags:
      k8s.io/cluster-autoscaler/node-template/label/v1-cluster/capacityType: SPOT
      k8s.io/cluster-autoscaler/node-template/label/intent: apps
      k8s.io/cluster-autoscaler/node-template/label/type: self-managed-spot
      k8s.io/cluster-autoscaler/node-template/taint/spotInstance: "true:PreferNoSchedule"
      kubernetes.io/cluster/v1-cluster: owned  - name: gpu-spot-1
    amiFamily: AmazonLinux2
    instanceType: mixed
    desiredCapacity: 0
    minSize: 0
    maxSize: 5
    iam:
      withAddonPolicies:
        autoScaler: true
        albIngress: true
    instancesDistribution:
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      instanceTypes:
        - p3.2xlarge
        - p3.8xlarge
        - p2.8xlarge
      spotInstancePools: 5
    tags:
      k8s.io/cluster-autoscaler/node-template/taints/nvidia.com/gpu: "true:NoSchedule"
      k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: "true" #match pod's nodeSelector
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/v1-cluster: "owned"
      k8s.amazonaws.com/accelerator: nvidia-tesla
      kubernetes.io/cluster/v1-cluster: owned
    labels:
      lifecycle: Ec2Spot
      nvidia.com/gpu: "true"
      k8s.amazonaws.com/accelerator: nvidia-tesla
    taints:
      nvidia.com/gpu: "true:NoSchedule"
    privateNetworking: true
    availabilityZones: ["us-east-1b"] #prevent az mismatch  - name: gpu-spot-2
    amiFamily: AmazonLinux2
    instanceType: mixed
    desiredCapacity: 0
    minSize: 0
    maxSize: 5
    iam:
      withAddonPolicies:
        autoScaler: true
        albIngress: true
    instancesDistribution:
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      #spotAllocationStrategy: capacity-optimized
      instanceTypes:
        - p3.2xlarge
        - p3.8xlarge
        - p2.8xlarge
      spotInstancePools: 5
    tags:
      k8s.io/cluster-autoscaler/node-template/taints/nvidia.com/gpu: "true:NoSchedule"
      k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: "true" #match pod's nodeSelector
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/v1-cluster: "owned"
      k8s.amazonaws.com/accelerator: nvidia-tesla
      kubernetes.io/cluster/v1-cluster: owned
    labels:
      lifecycle: Ec2Spot
      nvidia.com/gpu: "true"
      k8s.amazonaws.com/accelerator: nvidia-tesla
    taints:
      nvidia.com/gpu: "true:NoSchedule"
    privateNetworking: true
    availabilityZones: ["us-east-1a"] #prevent az mismatch

A new EKS cluster with 2 Spot t3.micro instances will be created.

Here are a few points to consider:

Line 15–24: Both VPC and subnets ID that are provisioned in the previous session has to be specified.

subnets:
    public:
      us-east-1a:
        id:	subnet-0e4731ea7bxxxxxx78
      us-east-1b:
        id:	subnet-0899xxxxxxxxxxxxxb
    private:
      us-east-1a:
        id: subnet-0aaxxxxxxxxxxxxxx	
      us-east-1b:
        id: subnet-0xxxxxxxxxxxxxxxx

Line 33–35, 61–63, 94–96: The node groups are marked so that an auto-scaler can control them.

withAddonPolicies:
        autoScaler: truewithAddonPolicies:
        autoScaler: true

Line 41, 100: spotAllocationStrategy: capacity-optimized is one of the allocation strategies to minimize the chance of Spot Instance interruptions. This strategy examines the available capacity pool for each instance type in every availability zone, and instance types with the highest capacity will be chosen to be launched, which means that they will not be affected by price fluctuations and remain active until the free capacity level changes. Thus, the number of interruptions to your instance and service will be minimal. Alternatively, you can use spotAllocationStrategy: capacity-optimized-priotized to specify the priority order of the instance types of your preference. Else, you may use spotAllocationStrategy: price-optimized strategy so that the lowest-priced instance always be chosen. Remember that the price fluctuations will happen more frequently, which results in a higher rate of interruptions.
Line 53, 76, 110: kubernetes.io/cluster/<name> tag set to owned . The Cluster Autoscaler requires this tag on your Auto Scaling groups so that they can be auto-discovered during scale up or down.
Line 44:intent to allow you to deploy specific applications on labelled nodes. For example,intent: apps allows you to deploy stateless applications on nodes that have been labelled with apps; or you use the label intent=control-apps to deploy control applications on nodes that are labelled with value control apps.
Line 55, 88 amiFamily: AmazonLinux2 : Indicates the use of the EKS AMI based on Amazon Linux 2. You may specify other OS based on your preference by referencing here).
Line 85, 119privateNetworking: true : When placing the node groups inside a private subnet, privateNetworking must be set to true on the node groups. This ensures that only private IP addresses are assigned to EC2 instances.
Line 83, 117: The GPU node group has a taint that prevents cluster infra services from getting scheduled on them.
Line 118–123: If you are using nodeSelector and taints, you need to tag the ASG with a node-template key [k8s.io/cluster-autoscaler/node-template/label/](<http://k8s.io/cluster-autoscaler/node-template/label/>) and [k8s.io/cluster-autoscaler/node-template/taint/](<http://k8s.io/cluster-autoscaler/node-template/taint/>). When scaling from ‘0’, the node taints and labels are not visible in the Launch Configuration, hence tagging the resources will enable Cluster Autoscaler to discover available instance types from zero. More information for ASG tags can be read here.
Lines 86, 120 availabilityZones: ["us-east-1b"]: Specify the AZ for the node groups to run on — running in the same zone is important to avoid cross AZ network charges (e.g. when the spot instance is interrupted and a new instance is spawned in a different AZ which can be significant. If you prefer running a node type in a different AZ, you may duplicate the whole configurations of the corresponding group type and specify another AZ for it.

Step IV: Create Cluster

To create an EKS cluster from the configuration file, execute the following command:

eksctl create cluster -f aws/v4-cluster.yml

Creating a cluster takes 15–25 minutes, and you’ll see many lines of output. The last one looks like the following:

Sample Output

....
[✓]  EKS cluster "your-cluster-name" in "region-code" region is ready

Step V: (Optional) Deploy kube-ops-view

Kubernetes Operational View (kube-ops-view) is a simple web page to visualize the operations of the Kubernetes cluster. During cluster auto-scaling operations, you can observe how scale-ins and scale-outs occur visually, though they’re not used for monitoring and operations management.

Follow the guide to install Helm.
Install kube-ops-view.

helm install kube-ops-view \
stable/kube-ops-view \
--set service.type=LoadBalancer \
--set rbac.create=True

3. From the result below, copy and paste the domain address in the EXTERNAL-IP into the web browser.

kubectl get svc kube-ops-view | tail -n 1 | awk '{ print "Kube-ops-view URL = http://"$4 }'

4. You can browse the URL and check the status of the cluster you have currently deployed.

Step VI: Create Auto Scaler

Create an IAM OIDC provider for your cluster. v3-cluster are required changed to your cluster name.

eksctl utils associate-iam-oidc-provider --cluster <your cluster name> --approve

JSON Web Token (JWT) is used by OpenID Connect (OIDC) layer to share security information between a client and a server. OIDC JSON web token is a new feature added in Kubernetes version 1.12. Now, EKS hosts a public OIDC discovery endpoint per cluster, enabling a third party such as IAM to validate the end-users and receive their basic information.

2. Create IAM Policy and Role

a. Save the following contents to a file that’s named cluster-autoscaler-policy.json. If your existing node groups were created with eksctl and you used the --asg-access option, then this policy already exists and you can skip to Step 2-b.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeLaunchTemplateVersions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeTags", "autoscaling:SetDesiredCapacity", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeLaunchTemplateVersions" ], "Resource": "*", "Effect": "Allow" } ] }

b. Create the policy with the following command. You can change the value for policy-name.

aws iam create-policy \
--policy-name AmazonEKSClusterAutoscalerPolicy \--policy-document file://cluster-autoscaler-policy.json

Take note of the Amazon Resource Name (ARN) that’s returned in the output. You need to use it in a later step.

eksctl create iamserviceaccount \
  --cluster=<my-cluster> \
  --namespace=kube-system \
  --name=cluster-autoscaler \
  --attach-policy-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:policy/<AmazonEKSClusterAutoscalerPolicy> \
  --override-existing-serviceaccounts \
  --approve

3. Set Up Node Auto Scaling with Kubernetes Cluster Scaler. Follow the command below to install Cluster Autoscaler:

# install cluster autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
# Edit deployment 
kubectl -n kube-system edit deployment.apps/cluster-autoscaler
# Update autoscaler deployment flags
# Replace <Your Cluster Name> with cluster name in the script
 - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<Your Cluster Name>
 - --balance-similar-node-groups 
 - --skip-nodes-with-system-pods=false

When using separate node groups per zone, the -balance-similar-node-groups flag will keep nodes balanced across zones for workloads that do not require topological scheduling. (source) The skip-nodes-with-system-pods=false flag will enable auto-scaling for the nodes consisting of kube-system pods so that the kube-system pods can reside on the least node.

4. Then, check the existing running pods to ensure the auto-scaler has been deployed smoothly.

kubectl get pods --all-namespaces

Autoscaler pod is running in the cluster node.

You can also view the pods running in kube-ops-view.

View the running pods of autoscaler using Kube-Ops-View.

Step VII: Scaling GPU Nodes from 0 with Kubernetes Deployment

Code References to EKS GPU Cluster from Zero to Hero.

Now, we are going to test the scaling of GPU nodes from 0. Before starting, list the worker nodes that are currently running in the cluster.

kubectl get nodes --output="custom-columns=NAME:.metadata.name,ID:.spec.providerID,TYPE:.metadata.labels.beta\\.kubernetes\\.io\\/instance-type"

Then, creates a vector-add-dpl.yml file. The following is a sample of a Deployment with a single attached GPU resource which rolls out a ReplicaSet to bring up nginx Pods. Kubernetes deployment with 10 pod replicas with nvidia/gpu: 1 limit.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-vector-add
  labels:
    app: cuda-vector-add
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cuda-vector-add
  template:
    metadata:
      name: cuda-vector-add
      labels:
        app: cuda-vector-add
    spec:
      nodeSelector: #refer to nodegroups' label
        nvidia.com/gpu: "true"
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      containers:
        - name: cuda-vector-add
					# <https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile>
          image: "k8s.gcr.io/cuda-vector-add:v0.1"
          resources:
            limits:
              nvidia.com/gpu: 1 # requesting 1 GPU

Apply the manifest file to the cluster by following the command below:

kubectl create -f aws/vector-add-dpl.yml

You can observe the Auto Scaler operations through the kube-ops-view:

Before scaling, the pods are waiting for the launches of the node:

The pod is waiting to be placed into a GPU node.

Or, you can also check the real-time auto-scaling logs using kubectl:

kubectl logs deployment/cluster-autoscaler -n kube-system -f

The scale-up logs will be shown as below:

Instances will be scaled up around 15- minutes. After scaling, you will see that pod has been placed into a node:

Pod information can be viewed through the following commands:

kubectl describe pod <your-pod-id>

Sample Output of Pod Events:

Sample Output of Running Nodes:

$ kubectl get nodes --output="custom-columns=NAME:.metadata.name,ID:.spec.providerID,TYPE:.metadata.labels.beta\\.kubernetes\\.io\\/instance-type,NODEGROUP:.metadata.labels.alpha\\.eksctl\\.io\\/nodegroup-name"                                                                                NAME                              ID                                      TYPE         NODEGROUP                                                                        ip-192-168-176-139.ec2.internal   aws:///us-east-1a/i-0c2a077d4f6c76ee3   t3.small     ng-spot-cpu-4                                                                    ip-192-168-199-152.ec2.internal   aws:///us-east-1b/i-0801904aa3f9a0682   p3.2xlarge   gpu-spot-ng-4

You may also test the scaling by adjusting the replicas and checking the results:

# Adjust pod number
kubectl scale --replicas=3 deployment/cuda-vector-add

Step VIII: Deploy AWS Node Termination Handler

With this AWS Node Termination Handler (NTH), Kubernetes controls can react appropriately to all kinds of events that may cause your EC2 instances to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, ASG Scale-In, ASG AZ Rebalancing, and EC2 Instance Termination via the API or Console. Else, the application code may fail to stop gracefully, take longer to recover full availability, or accidentally schedule work on unavailable nodes if not handled properly.

Note: you may ignore the if you are using managed node groups, as it will be handled automatically by EKS.

helm repo add eks [<https://aws.github.io/eks-charts>](<https://aws.github.io/eks-charts>)
helm install aws-node-termination-handler \\
--namespace kube-system \\
--version 0.15.4 \\
--set nodeSelector.type=self-managed-spot \\
eks/aws-node-termination-handler

Deployment Output:

Then, you will see a new running pod exist in the kube-system.

$ kubectl get pods --all-namespaces                                                                                                         NAMESPACE     NAME                                   READY   STATUS             RESTARTS   AGE                                                                                                                                                                                kube-system   aws-node-termination-handler-wtd4k     1/1     Running            0          23s

Last Step: Clean Up Cluster

Last but not least, make sure you delete all the resources created at the end of the tutorial to avoid being shocked when you receive your AWS bill at the end of the month.

$ eksctl delete cluster --region=us-east-1 --name=<your-cluster-name>
$ aws cloudformation delete-stack --stack-name my-eks-vpc-stack --region us-east-1

Sample Output: