Automated AWS EKS Upgrade using Github Actions (Kubernetes)

Aditya Bhangle

Published in

KPMG UK Engineering

5 min readJan 9, 2024

Introduction — What is EKS?

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service to run Kubernetes in the AWS cloud
Amazon EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks.
With Amazon EKS, you can take advantage of all the performance, scale, reliability, and availability of AWS infrastructure, as well as integrations with AWS networking and security services.

About the Implementation

This implementation aims at automating the entire AWS EKS (Kubernetes) upgrade process and minimize the human intervention to a bare minimum. The AWS documentation link to upgrade the AWS Cluster along with its dependencies, like different types of computes in use, addons, etc — there are at-least 9–10 steps that the user needs to follow, which requires a lot of efforts and is time-consuming.

Also, to implement the same for multiple cluster, the entire set of steps needs to be repeated. This could also lead to errors and issues that may crop up during the configuration.

With this implementation, we have reduced the time taken and entire human effort to upgrade the AWS EKS Cluster and all its dependencies. Also, since the entire upgrade process is automated mainly using Terraform Scripts for deployment + Python / Bash scripts for the upgrade steps, it is easily repeatable and reusable across multiple environments and projects with minimum changes.

The entire automation is executed through GitHub Action workflows and is triggered via Cron Jobs in a pre-defined order of environments.

Reusable

Designed in a way that it is easily reusable across environments and projects. It consists of parameter based Terraform Python / Bash scripts for the upgrade steps which can be easily plugged into new environment

Fully Automated Process

The entire AWE EKS upgrade process is fully automated and requires minimal human intervention. The GitHub Action workflows are triggered via Cron Jobs at predefined intervals as per the project / environment requirements

Checks in place to ensure multiple things do not fail

Designed such that, if one of the upgrade fails, the next one in the order does not get executed. This ensures the cluster upgrades are done in a safe manner where multiple things are not broken at a time

Challenges

AWS lets you upgrade AWS through console but there are a few things that you may have to do manually like
Upgrading the fargate nodes
Upgrading the launch template to reflect version upgrade AMIs and then linking it to the auto scaling groups
Upgrading addons present on the AWS EKS cluster
Cluster / Infra deployment along with all its dependencies is automated and handled by terraform scripts. Becomes difficult if the entire process is manual
Steps as per AWS documentation — https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

Design Considerations

To ensure smooth upgrades, we built the design of workflows in GitHub action in such a way that

•All the workflows are automated and executed using periodic Cron jobs

•These workflows should be able to handle EKS clusters in multiple AWS accounts and multiple environments like dev, test-1, test-2, preprod, prod, etc.

•The order in which all these environments get upgraded is pre-defined and only 1 upgrade runs at a time with sufficient time gaps, before the next one kicks in.

•There should be checks in place such that, if one of the upgrade fails due to any reason, the next one in the order does not get executed. This ensures the cluster upgrades are done in a safe manner where multiple things are not broken at a time. Only when the failed workflow for a cluster is fixed and executed successfully, will the next cluster in order go for the upgrade.

•Making sure the upgrades do a Rolling Update to ensure all the services are running at all times

• Alerting the status of upgrades to the users via Teams channels

Compute Components — Key Differences

AWS Managed Node Groups
– AWS Managed Ec2 Instances
– User chooses predefined AMI types given by AWS like Linux/Windows , Instance type, Disk size, Min-Max nodes
– Infra “maintained” by AWS — Create predefined Autoscaling group + launch template
– Simpler ec2 setup
Self Managed Worker Groups
– Self Managed Ec2 Instances.
– This is part of an Auto Scaling Group. A Launch template is defined with preferred configuration like Instance Type, AMI (Image), Volume Size, etc.
– User also has to define minimum and maximum number of instances that are required. Depending on the load/traffic/cpu/memory, number of instance either increase or decrease
– Users’ responsibility to build and maintain the infra.
Fargate Nodes
– AWS Managed right-sized Compute. AWS deploys the selected applications on aws-managed containers and handles scaling.
– User doesn’t have to define launch configuration. AWS deploys the right sized fargate containers required to run the application.

Workflow — Features

AWS EKS Upgrade Workflow in GitHub Actions

Conclusion

If things go well, at the end of the pipeline execution, you have

EKS Cluster upgraded to next version available (upgrades by 1 version at a time — in case the cluster was behind by multiple versions)
Fargate Node Profiles upgraded to the current EKS version after step 1
Custom Node Groups (EC2s) having AMIs of current EKS version after step 1
EKS Addons upgraded to next versions. (upgrades by 1 minor version at a time — in case the addon was behind by multiple versions)