Hybrid Cloud Setup for Deep Learning on Amazon EKS
GPU Spot Instances on Amazon Elastic Kubernetes Service’s Cluster with Autoscaling-from-0 Feature
Introduction
Currently, there is a K8s cluster with 16 GPUs on hand. However, intensive training tasks are exhausting and the available GPUs are stalling training progress. There is a need to over-spill the training workload into external clusters to ensure the generation of accurate models at pace. The straightforward answer would definitely be deploying on cloud service providers such as AWS, Azure and GCP. However, cost and security issues would immediately surface as major obstacles for cloud deployment.
In this document, I would attempt to solve model training overload issues using cloud servers and mitigate cost and security issues.
Pre-requisite
Before we proceed, please make sure that you have installed kubectl, AWS CLI and eksctl. Also, please make sure that the AWS IAM user has sufficient permission to work with EKS.
eksctl
eksctl is a simple CLI tool that creates and manages clusters on EKS - Amazon's managed Kubernetes service for EC2.
AWS CLI
AWS CLI is a unified tool to manage your AWS services. With just one tool, you can control multiple AWS services from a command line and automate them with scripts.
Auto Scaling Group (ASG)
An Auto Scaling group contains a collection of Amazon EC2 instances that are treated as a logical grouping for the purposes of automatic scaling and management. An Auto Scaling group also enables you to use Amazon EC2 Auto Scaling features such as health check replacements and scaling policies. Maintaining the number of instances in an Auto Scaling group and automatic scaling are the core functionalities of the Amazon EC2 Auto Scaling service.
Spot Instance
The spot instance is an instance that uses spare EC2 capacity that is available at less than 70% of the on-demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price. The Spot price of each instance type in each Availability Zone is set by Amazon EC2 and is adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available and the maximum price per hour for your request exceeds the Spot price.
IAM Policies
In order to use Amazon EKS, you need an account with access to several security permissions. Security permissions can be set with AWS Identity and Access Management (IAM). The following are the required policies:
EKSFullAccess
AWSCloudFormationFullAccess
AmazonEC2FullAccess
IAMFullAccess
AmazonEC2ContainerRegistryReadOnly
AmazonEKS_CNI_Policy
AmazonS3FullAccess
You can view the detailed guide here, and search for the Security Settings sections.
Working Instruction
Step I: Set Up AWS CLI
AWS CLI version 2
After installation is completed, set up the AWS CLI with aws configure
command and both existing and credentials files. Before starting, make sure you have already acquired the key ID and secret access key of the IAM user. If you haven't created an IAM role, do refer here.
Below are examples of what AWS CLI will prompt you with. You are required to substitute your own values for each information.
$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalxxxxxxEMI/K7Mxxxx/bPxRfiCxxxxxMPLEKEY
Default region name [None]: us-east-1
Default output format [None]: yaml
Step II: Create VPC with Public and Private Subnets
EKS normally creates a VPC automatically during cluster creation. If you wish to have more control over the cluster resources, it is recommended to perform a provision on your VPC manually. For more VPC creation information, you can refer here. Generally, you would be required to consider the VPC architecture to be either:
- Public Subnet only, or
- Private Subnet only, or
- Public and Private Subnet only
Amazon EKS recommends to run the cluster in a VPC with public and private subnets so that Kubernetes can create public load balancers in the public subnets that load balance traffic to pods running on nodes that are in private subnets. This configuration is not required, however, you can run a cluster in a VPC with only private or only public subnet, depending on your networking and security requirements
Note: In this configuration, worker nodes are instantiated in the private subnets and NAT Gateway are instantiated in the public subnets. If you prefer building stacks for the private- or public-only subnet, do check out other available templates here.
VPCs can be created manually using two methods: AWS Console or AWS CLI.
Option I: Using AWS Console:
- Open the AWS CloudFormation console.
- Select Create Sack, with new resources (standard).
- Under Prepare template, make sure that Template is ready is selected and then under Template source, select Amazon S3 URL. Paste the following URL into the text area: https://amazon-eks.s3.us-west-2.amazonaws.com/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml
- Select Next, specify the parameters accordingly on the Specify Details page. Then choose Next.
- (Optional) Tags the stack resources. Click Next, Create Stack.
- Your VPC Stacks will be created within 10 minutes. You can view it in the VPC console.
Option II: Using AWS CLI:
- If you prefer to have more flexibility on the resources and network configurations, you may choose to ignore this step and proceed directly to step 2 to download and edit the configuration template. Or, you could also use AWS templates without making modifications, which is as easy as running the following commands.
Note: You may replace my-eks-vpc-stack
with your own stack name.
aws cloudformation create-stack \\ --stack-name my-eks-vpc-stack \\ --template-url <https://s3.us-west-2.amazonaws.com/amazon-eks/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml>
2. As an alternative, you may download https://amazon-eks.s3.us-west-2.amazonaws.com/cloudformation/2020-10-29/amazon-eks-vpc-private-subnets.yaml and edit the template. In my case, I would like to add tagging to VPC for billing purposes. Then, apply the command below to create a VPC stack:
aws cloudformation create-stack \\ --stack-name my-eks-vpc-stack \\ --template-body file://aws/amazon-eks-vpc-private-subnets.yaml
Sample Output
The stacks will be created in a few seconds. The sample output will be:
StackId: arn:aws:cloudformation:us-east-1:xxxxxxxxxxxx:stack/my-eks-vpc-stack/9fxxxxxx-63c7-11ec-xxxx-0eaaxxx16x42b
Now, you’re able to view the stack at CloudFormation.
The VPC Dashboard also provides access to other resources like Subnets, NAT Gateways, Route tables, and more. Please take note of the IDs for VPC and all subnets for the network configurations in the next section.
Step III: Config YAML File for EKS Cluster
Create a YAML file with the configurations below. You’re required to attach your own VPC and subnets IDs that were created previously.
Note: This configuration creates an EKS Cluster in a VPC that contains public and private subnets. If you prefer an architecture with a private subnet only, you may refer to the configuration here.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: v1-cluster
region: us-east-1
tags:
department: CoE #resource tagging
environment: dev #resource tagging
vpc:
nat:
gateway: HighlyAvailable # other options: Disable, Single
id: vpc-0361b7xxxxxxxxxxx
subnets:
public:
us-east-1a:
id: subnet-0e4731ea7bxxxxxx78
us-east-1b:
id: subnet-0899xxxxxxxxxxxxxb
private:
us-east-1a:
id: subnet-0aaxxxxxxxxxxxxxx
us-east-1b:
id: subnet-0xxxxxxxxxxxxxxxx
nodeGroups:
- name: cpu-spot
minSize: 0
maxSize: 10
desiredCapacity: 1
privateNetworking: true
iam:
withAddonPolicies:
autoScaler: true
albIngress: true
instancesDistribution:
instanceTypes:
- t3.small
onDemandBaseCapacity: 0 #Specify Spot Instance
onDemandPercentageAboveBaseCapacity: 0 #Specify Spot Instance
spotAllocationStrategy: capacity-optimized
labels:
v1-cluster/capacityType: SPOT
intent: apps
type: self-managed-spot
taints:
spotInstance: "true:PreferNoSchedule"
tags:
k8s.io/cluster-autoscaler/node-template/label/v1-cluster/capacityType: SPOT
k8s.io/cluster-autoscaler/node-template/label/intent: apps
k8s.io/cluster-autoscaler/node-template/label/type: self-managed-spot
k8s.io/cluster-autoscaler/node-template/taint/spotInstance: "true:PreferNoSchedule"
kubernetes.io/cluster/v1-cluster: owned - name: gpu-spot-1
amiFamily: AmazonLinux2
instanceType: mixed
desiredCapacity: 0
minSize: 0
maxSize: 5
iam:
withAddonPolicies:
autoScaler: true
albIngress: true
instancesDistribution:
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
instanceTypes:
- p3.2xlarge
- p3.8xlarge
- p2.8xlarge
spotInstancePools: 5
tags:
k8s.io/cluster-autoscaler/node-template/taints/nvidia.com/gpu: "true:NoSchedule"
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: "true" #match pod's nodeSelector
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/v1-cluster: "owned"
k8s.amazonaws.com/accelerator: nvidia-tesla
kubernetes.io/cluster/v1-cluster: owned
labels:
lifecycle: Ec2Spot
nvidia.com/gpu: "true"
k8s.amazonaws.com/accelerator: nvidia-tesla
taints:
nvidia.com/gpu: "true:NoSchedule"
privateNetworking: true
availabilityZones: ["us-east-1b"] #prevent az mismatch - name: gpu-spot-2
amiFamily: AmazonLinux2
instanceType: mixed
desiredCapacity: 0
minSize: 0
maxSize: 5
iam:
withAddonPolicies:
autoScaler: true
albIngress: true
instancesDistribution:
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
#spotAllocationStrategy: capacity-optimized
instanceTypes:
- p3.2xlarge
- p3.8xlarge
- p2.8xlarge
spotInstancePools: 5
tags:
k8s.io/cluster-autoscaler/node-template/taints/nvidia.com/gpu: "true:NoSchedule"
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: "true" #match pod's nodeSelector
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/v1-cluster: "owned"
k8s.amazonaws.com/accelerator: nvidia-tesla
kubernetes.io/cluster/v1-cluster: owned
labels:
lifecycle: Ec2Spot
nvidia.com/gpu: "true"
k8s.amazonaws.com/accelerator: nvidia-tesla
taints:
nvidia.com/gpu: "true:NoSchedule"
privateNetworking: true
availabilityZones: ["us-east-1a"] #prevent az mismatch
A new EKS cluster with 2 Spot t3.micro instances will be created.
Here are a few points to consider:
- Line 15–24: Both VPC and subnets ID that are provisioned in the previous session has to be specified.
subnets:
public:
us-east-1a:
id: subnet-0e4731ea7bxxxxxx78
us-east-1b:
id: subnet-0899xxxxxxxxxxxxxb
private:
us-east-1a:
id: subnet-0aaxxxxxxxxxxxxxx
us-east-1b:
id: subnet-0xxxxxxxxxxxxxxxx
- Line 33–35, 61–63, 94–96: The node groups are marked so that an auto-scaler can control them.
withAddonPolicies:
autoScaler: truewithAddonPolicies:
autoScaler: true
- Line 41, 100:
spotAllocationStrategy: capacity-optimized
is one of the allocation strategies to minimize the chance of Spot Instance interruptions. This strategy examines the available capacity pool for each instance type in every availability zone, and instance types with the highest capacity will be chosen to be launched, which means that they will not be affected by price fluctuations and remain active until the free capacity level changes. Thus, the number of interruptions to your instance and service will be minimal. Alternatively, you can usespotAllocationStrategy: capacity-optimized-priotized
to specify the priority order of the instance types of your preference. Else, you may usespotAllocationStrategy: price-optimized
strategy so that the lowest-priced instance always be chosen. Remember that the price fluctuations will happen more frequently, which results in a higher rate of interruptions. - Line 53, 76, 110:
kubernetes.io/cluster/<name>
tag set toowned .
The Cluster Autoscaler requires this tag on your Auto Scaling groups so that they can be auto-discovered during scale up or down. - Line 44:
intent
to allow you to deploy specific applications on labelled nodes. For example,intent: apps
allows you to deploy stateless applications on nodes that have been labelled withapps
; or you use the labelintent=control-apps
to deploy control applications on nodes that are labelled with value control apps. - Line 55, 88
amiFamily: AmazonLinux2
: Indicates the use of the EKS AMI based on Amazon Linux 2. You may specify other OS based on your preference by referencing here). - Line 85, 119
privateNetworking: true
: When placing the node groups inside a private subnet,privateNetworking
must be set totrue
on the node groups. This ensures that only private IP addresses are assigned to EC2 instances. - Line 83, 117: The GPU node group has a taint that prevents cluster infra services from getting scheduled on them.
- Line 118–123: If you are using
nodeSelector
andtaints,
you need to tag the ASG with a node-template key[k8s.io/cluster-autoscaler/node-template/label/](<http://k8s.io/cluster-autoscaler/node-template/label/>)
and[k8s.io/cluster-autoscaler/node-template/taint/](<http://k8s.io/cluster-autoscaler/node-template/taint/>)
. When scaling from ‘0’, the node taints and labels are not visible in the Launch Configuration, hence tagging the resources will enable Cluster Autoscaler to discover available instance types from zero. More information for ASG tags can be read here. - Lines 86, 120
availabilityZones: ["us-east-1b"]
: Specify the AZ for the node groups to run on — running in the same zone is important to avoid cross AZ network charges (e.g. when the spot instance is interrupted and a new instance is spawned in a different AZ which can be significant. If you prefer running a node type in a different AZ, you may duplicate the whole configurations of the corresponding group type and specify another AZ for it.
Step IV: Create Cluster
To create an EKS cluster from the configuration file, execute the following command:
eksctl create cluster -f aws/v4-cluster.yml
Creating a cluster takes 15–25 minutes, and you’ll see many lines of output. The last one looks like the following:
Sample Output
....
[✓] EKS cluster "your-cluster-name" in "region-code" region is ready
Step V: (Optional) Deploy kube-ops-view
Kubernetes Operational View (kube-ops-view) is a simple web page to visualize the operations of the Kubernetes cluster. During cluster auto-scaling operations, you can observe how scale-ins and scale-outs occur visually, though they’re not used for monitoring and operations management.
- Follow the guide to install Helm.
- Install kube-ops-view.
helm install kube-ops-view \
stable/kube-ops-view \
--set service.type=LoadBalancer \
--set rbac.create=True
3. From the result below, copy and paste the domain address in the EXTERNAL-IP into the web browser.
kubectl get svc kube-ops-view | tail -n 1 | awk '{ print "Kube-ops-view URL = http://"$4 }'
4. You can browse the URL and check the status of the cluster you have currently deployed.
Step VI: Create Auto Scaler
- Create an IAM OIDC provider for your cluster.
v3-cluster
are required changed to your cluster name.
eksctl utils associate-iam-oidc-provider --cluster <your cluster name> --approve
JSON Web Token (JWT) is used by OpenID Connect (OIDC) layer to share security information between a client and a server. OIDC JSON web token is a new feature added in Kubernetes version 1.12. Now, EKS hosts a public OIDC discovery endpoint per cluster, enabling a third party such as IAM to validate the end-users and receive their basic information.
2. Create IAM Policy and Role
a. Save the following contents to a file that’s named cluster-autoscaler-policy.json
. If your existing node groups were created with eksctl
and you used the --asg-access
option, then this policy already exists and you can skip to Step 2-b.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeTags", "autoscaling:SetDesiredCapacity", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeLaunchTemplateVersions" ], "Resource": "*", "Effect": "Allow" } ] }
b. Create the policy with the following command. You can change the value for policy-name
.
aws iam create-policy \
--policy-name AmazonEKSClusterAutoscalerPolicy \--policy-document file://cluster-autoscaler-policy.json
Take note of the Amazon Resource Name (ARN) that’s returned in the output. You need to use it in a later step.
eksctl create iamserviceaccount \
--cluster=<my-cluster> \
--namespace=kube-system \
--name=cluster-autoscaler \
--attach-policy-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:policy/<AmazonEKSClusterAutoscalerPolicy> \
--override-existing-serviceaccounts \
--approve
3. Set Up Node Auto Scaling with Kubernetes Cluster Scaler. Follow the command below to install Cluster Autoscaler:
# install cluster autoscaler
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
# Edit deployment
kubectl -n kube-system edit deployment.apps/cluster-autoscaler
# Update autoscaler deployment flags
# Replace <Your Cluster Name> with cluster name in the script
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<Your Cluster Name>
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
When using separate node groups per zone, the -balance-similar-node-groups
flag will keep nodes balanced across zones for workloads that do not require topological scheduling. (source) The skip-nodes-with-system-pods=false
flag will enable auto-scaling for the nodes consisting of kube-system pods so that the kube-system pods can reside on the least node.
4. Then, check the existing running pods to ensure the auto-scaler has been deployed smoothly.
kubectl get pods --all-namespaces
You can also view the pods running in kube-ops-view.
Step VII: Scaling GPU Nodes from 0 with Kubernetes Deployment
Code References to EKS GPU Cluster from Zero to Hero.
Now, we are going to test the scaling of GPU nodes from 0. Before starting, list the worker nodes that are currently running in the cluster.
kubectl get nodes --output="custom-columns=NAME:.metadata.name,ID:.spec.providerID,TYPE:.metadata.labels.beta\\.kubernetes\\.io\\/instance-type"
Then, creates a vector-add-dpl.yml
file. The following is a sample of a Deployment with a single attached GPU resource which rolls out a ReplicaSet to bring up nginx
Pods. Kubernetes deployment with 10
pod replicas with nvidia/gpu: 1
limit.
apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-vector-add
labels:
app: cuda-vector-add
spec:
replicas: 1
selector:
matchLabels:
app: cuda-vector-add
template:
metadata:
name: cuda-vector-add
labels:
app: cuda-vector-add
spec:
nodeSelector: #refer to nodegroups' label
nvidia.com/gpu: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: cuda-vector-add
# <https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile>
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
Apply the manifest file to the cluster by following the command below:
kubectl create -f aws/vector-add-dpl.yml
You can observe the Auto Scaler operations through the kube-ops-view:
Before scaling, the pods are waiting for the launches of the node:
Or, you can also check the real-time auto-scaling logs using kubectl
:
kubectl logs deployment/cluster-autoscaler -n kube-system -f
The scale-up logs will be shown as below:
Instances will be scaled up around 15- minutes. After scaling, you will see that pod has been placed into a node:
Pod information can be viewed through the following commands:
kubectl describe pod <your-pod-id>
Sample Output of Pod Events:
Sample Output of Running Nodes:
$ kubectl get nodes --output="custom-columns=NAME:.metadata.name,ID:.spec.providerID,TYPE:.metadata.labels.beta\\.kubernetes\\.io\\/instance-type,NODEGROUP:.metadata.labels.alpha\\.eksctl\\.io\\/nodegroup-name" NAME ID TYPE NODEGROUP ip-192-168-176-139.ec2.internal aws:///us-east-1a/i-0c2a077d4f6c76ee3 t3.small ng-spot-cpu-4 ip-192-168-199-152.ec2.internal aws:///us-east-1b/i-0801904aa3f9a0682 p3.2xlarge gpu-spot-ng-4
You may also test the scaling by adjusting the replicas and checking the results:
# Adjust pod number
kubectl scale --replicas=3 deployment/cuda-vector-add
Step VIII: Deploy AWS Node Termination Handler
With this AWS Node Termination Handler (NTH), Kubernetes controls can react appropriately to all kinds of events that may cause your EC2 instances to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, ASG Scale-In, ASG AZ Rebalancing, and EC2 Instance Termination via the API or Console. Else, the application code may fail to stop gracefully, take longer to recover full availability, or accidentally schedule work on unavailable nodes if not handled properly.
Note: you may ignore the if you are using managed node groups, as it will be handled automatically by EKS.
helm repo add eks [<https://aws.github.io/eks-charts>](<https://aws.github.io/eks-charts>)
helm install aws-node-termination-handler \\
--namespace kube-system \\
--version 0.15.4 \\
--set nodeSelector.type=self-managed-spot \\
eks/aws-node-termination-handler
Deployment Output:
Then, you will see a new running pod exist in the kube-system.
$ kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-system aws-node-termination-handler-wtd4k 1/1 Running 0 23s
Last Step: Clean Up Cluster
Last but not least, make sure you delete all the resources created at the end of the tutorial to avoid being shocked when you receive your AWS bill at the end of the month.
$ eksctl delete cluster --region=us-east-1 --name=<your-cluster-name>
$ aws cloudformation delete-stack --stack-name my-eks-vpc-stack --region us-east-1
Sample Output:
References
[1] Nullius. (2019). EKS GPU Cluster from Zero to Hero. https://alexei-led.github.io/.
[2] Taber. (2020) De-mystifying cluster networking for Amazon EKS worker nodes | Amazon Web Services. Amazon Blogs.
[3] Iglesias, Claudia, Sunil. (2021) 23-kubeflow-spot-instance Configurations. GitHub.
[4]Sunil, A. (2020). How we Reduced our ML Training Costs by 78%!!. Medium.
[5] Pinhasi, A. (2021). A Step by Step Guide to Building A Distributed, Spot-based Training Platform on AWS Using TorchElastic and Kubernetes. Medium.
[6] Autoscaling. (2019) Amazon EKS User Guide Documention