Streamlining Machine Image Updates: Automating Node Rotation in Kubernetes Clusters

Published in

Sage Ai

6 min readJun 18, 2024

The Problem

The Sage AI infrastructure team is responsible for the maintenance and stability of our services and systems in every environment. This means that sometimes we need to update nodes in our Kubernetes clusters with a new machine image within a relatively short time frame (and with minimal impact on service uptime) — for example when a security vulnerability is discovered in our current machine image. We use Amazon cloud so it’s Amazon Machine Images (AMIs) that we’re dealing with, but both the problem and the solution are applicable to Kubernetes clusters in any cloud context.

In our case, we will apply an update with a new AMI ID, but then we need to ensure the nodes are rotated to pick up the new AMI within the time frame.

In the past, we used a custom shell script that would detect nodes with obsolete AMIs and drain them of existing workloads (the pods running on them need to be terminated in a controlled manner in order to prevent disruptions to the services they provide). Using this approach had a number of downsides. First, someone had to invoke and monitor the script manually. Second, we are using an authenticating proxy to access the cluster, and draining the node with the proxy on it would cause script to lose the server connection and error out. Sometimes it would happen several times in a row. Needless to say, running the node rotation would take an engineer in charge of the task a significant amount of time, attention, and manual effort.

Designing the solution

We decided to find a solution that would drastically reduce the load on engineers that were responsible for updating the AMIs in our clusters. We knew that we needed a tool that could automatically rotate the nodes that had outdated AMIs, but that is a broad ask. Unpacking that a bit more, we came up with some requirements for the project;

Requirements

We wanted a tool that was easy to operate, ideally with zero supervision. Engineer’s time is valuable and it makes little sense to waste it babysitting a cranky tool.
We wanted a tool that was simple to maintain and update, if necessary. We use a lot of third party tools and periodically have to upgrade essentially all of them. Some of them are much harder to update than others.
We wanted a solution that let us use well established monitoring and observability processes we have set up for workloads running in our Kubernetes clusters.

Our solution

We decided to design our system to work in two separate stages; ascertaining the nodes that need to be cycled out, and then draining those nodes of their workloads.

In a nutshell; to drain a node means to tag a node so that Kubernetes does not attempt any new workloads against it while its processes and state are torn down, and new nodes are brought up to replace them.

There were a couple of reasons for doing this;

It allows us to evolve and maintain the two distinct components independently.
The components have different lifecycle expectations, and the node selection component may need to change from time to time as our selection criteria or the underlying infrastructure changes, while the component responsible for draining the nodes remains relatively stable.
The node selection component interacts primarily with the AWS API while the node drainer is a Kubernetes controller.

One final reason for the components being developed separately is that both components are written in different languages by necessity and design; The component responsible for selecting nodes would benefit from ease and brevity of using Bash and mature tools like AWS CLI and JSON manipulation tools such as jq. We chose to implement our solution in the Go ecosystem, on the other hand, because there is extensive support for writing Kubernetes controllers in the form of libraries and documentation, so it made sense to write the component responsible for draining nodes in Go.

Next, we had to decide how to communicate between the parts. Usually, a way to mark a node for some operation is to label it. But in this case we decided to use a taint. The advantage is, once a node is tainted, no pods can be launched on it, preventing evicted pods from launching on them. Thus, the node selection task would taint the nodes, and the draining task would identify nodes that had been tainted by the node selection task and drain them of pods. Subsequent termination of the drained nodes would be left to the cluster autoscaler.

Rejected approaches

There are multiple implementations of AWS Lambda based node drainers (such as https://github.com/aws-samples/amazon-k8s-node-drainer) but we wanted a tool that is designed to run in the cluster in order to use established monitoring capabilities we have.

Projects already exist that implement node draining based on AWS Auto Scaling Group node refresh process, https://github.com/rebuy-de/node-drainer being a prime example. We have considered it but that project has not released any updates since March 2023, so adopting it would be risky — in case of a serious vulnerability discovered in it or its dependencies, we would not be able to obtain an updated image, and would have to start maintaining our own fork.

Implementation

The second part of the solution is implemented as a Kubernetes controller which watches nodes and starts draining a node of pods when it sees a specified draining taint put on it. It will proceed with draining the node until it times out (sometimes pods just don’t wanna leave) or a timeout passes. It will re-try that node after a back-off timeout, though. The program always serializes node draining, and after finishing with a node, waits until there a no pending pods in the cluster before proceeding. This is done to minimize issues with the capacity in the cluster — if all pods are tainted for draining, it will be impossible to launch new pods until the cluster autoscaler brings up a new node.

Choosing to release as our first OSS contribution

Sage AI’s teams made a decision to identify projects that we have developed internally which we believed would benefit the community at large. Our infrastructure team has done a lot of great work in the MLOps space, and we are proud to announce that our automation for the workflow described above has been released to the open source community!

Our approach has been in use internally for over a year now, and the code can now be found here; https://github.com/sageailabs/ektopistis .

Please feel free to take a look, use it in your environment if you find it useful, and to contribute with feedback or code!

You can install it in your cluster using Helm by following the instructions in the Installation section of the project README. The semantics for the node selection component are arbitrary, and can be configured based on the desired selection criteria and tagging semantics. As an example, we have a script that marks all nodes in the cluster that are launched from AWS autoscaling groups and whose configuration is not up-to-date withe the ASG’s launch template.

We believe that sharing our tools with the community will lead to better solutions for everyone, and we look forward to your feedback and contributions.

A note on planned future work

Our node drainer implementation suspends its operation while there are pending pods in the cluster to avoid disrupting updates for important workloads, but this has a significant downside; the larger the cluster and the more activity in it, the higher the pod turnover. This means potentially long periods when there are pending pods present, freezing our current version of the node drainer. We have plans to refine this stand by policy to speed up the process.