Our evolving EKS environments

Published in

Funding Circle

4 min readJun 22, 2022

Funding Circle’s evolving Kubernetes platform… introducing Managed Node Groups

A few months ago I published a couple of blogs detailing our Kubernetes platform here at Funding Circle. As a follow on from these introductory blogs, this next blog in the Kubernetes blog series explores a deep dive with one of the evolutions of our Kubernetes environment, the use of EKS Managed Node Groups.

Old world

When we first started using Kubernetes and EKS, we used a similar approach for our Kubernetes worker nodes as we did for our Apache Mesos nodes, AWS EC2 Autoscaling Groups. ASGs gave us a mechanism that integrated with the Kubernetes cluster-autoscaler so that we could automatically scale our cluster based on load.

The issue we had with ASGs was the management of the worker nodes, especially concerning updates of the EC2 Launch Configuration, including upgrading the EKS cluster version. This process involved a lot of manual “rotation” of the worker nodes to pick up new changes. Our goal was to simplify, and as much as possible automate, cluster upgrades.

EKS Managed Node Groups

When evaluating possible solutions we looked at a number of solutions including:

Developing our own automation to automate our manual process with ASGs
Using a 3rd Party / Open Source tool to handle upgrades
Using the new EKS Managed Node Group solution, providing an AWS native solution to managing our worker nodes.

We completed some investigations around the different solution and decided to move forward with Managed Node Groups because it:

is an AWS native solution, with integration with the cloud provider
automatically updates after EKS cluster upgrades
requires no maintenance of bespoke tooling
doesn’t introduce a dependency on 3rd party tooling

Migration Process

The first stage of the migration process was to migrate the EC2 Launch Configurations in Terraform to Launch Templates, along with any user data for the nodes.

The second stage was to introduce the new Terraform aws_eks_node_group resources matching the same configuration as the aws_autoscaling_group. Some simplifications could be made though, for example Managed Node Groups support setting node labels and taints using the AWS EKS API / Terraform module, instead of in user data. These values were migrated to the new aws_eks_node_group resources.

When we initially introduced the Managed Node Groups, they appeared alongside the existing ASGs, and then we manually drained the ASGs causing all pods to move to the new Managed Node Group nodes.

Finally the ASGs were scaled down to 0 using the Terraform configuration, and eventually removed once we were confident the migration had been successful.

Issues

The migration process has been largely successful with a few issues:

Automatic updating following cluster upgrade has been disabled. We experienced some issues with the Launch Templates automatically updating following a cluster upgrade. The upgrade process would timeout before the update was completed, and revert back to the old launch template version. After some investigation and experimentation we created a script to trigger the Managed Node Group updates in smaller groups of node groups as opposed to all at once. This seems to alleviate the failures of updating the launch template version. This also means the nodes are rotated in a slower, more controlled way reducing the rescheduling of pods multiple times. This is especially useful for production clusters.
Jenkins agents drained during builds — our clusters host a Jenkins build agent within the cluster. During the upgrade process this agent could be evicted during a build, causing builds to fail. We have reduced the impact of this by taking the Jenkins agent offline during an upgrade so builds are queued until the agent has been moved.

Conclusion

Overall the migration process has greatly improved our process for upgrading the EKS clusters, while still maintaining the support for cluster autoscaler.

We have seen a decrease in the amount of time it takes to apply changes to a cluster. With manual rotation of nodes, the process would take up to half a day. The automated Managed Node Group process takes between 30–60 minutes per cluster, depending on the number of node groups / nodes, with much of this being automated.

The process to manage Managed Node Groups still requires some manual steps but we are working on further, minimal automation to remove the toil from this process.