From Blue to Green: Optimizing AWS EKS Clusters Upgrade with Blue/Green Tactic

Kareem Mohamed
OneFootball Tech
Published in
6 min readJan 16, 2024

Get ready for an inside look at here in OneFootball how the platform team is leveling up our EKS cluster game! Instead of sticking to the traditional in-place upgrade, we’re diving into the Blue/Green strategy world.

Now, we’re not here to overwhelm you with many fancy tools (although, we might sneak-peek into a few of them which comes very handy ).

What we’re excited to share are the lessons we’ve learned and the cool philosophy behind our approach. Imagine managing the tech side of things for over +100 million passionate football fans around the world at OneFootball — it’s quite a ride! So, buckle up for a tech adventure where we break down the why and how without getting too bogged down in the nitty-gritty details!

What prompted the decision to adopt the blue/green strategy over an in-place upgrade?

In the dynamic landscape of our ever-evolving ecosystem, where frequent software releases and security patches abound, our primary concern is ensuring a seamless transition when upgrading the underlying layer.

The goal is to safeguard the user experience and maintain the reliability of our infrastructure amidst these continuous changes. Several challenges loom large in this process, and we aim to shed light on some key considerations.

Firstly, the intricate dance of backward compatibility between infrastructure upgrades and workload resource versions is a delicate balance. Additionally, the irreversible nature of promoting an AWS EKS cluster upgrade without the ability to roll back to the previous version raises significant concerns. Embracing the blue/green approach minimizes risks associated with experimentation in a production environment.
Swift resolution of issues becomes achievable through simple routing changes, enabling a quick return to a stable production state in case of unforeseen complications. This approach fosters a more robust testing environment for new cluster functionalities, providing a secure space for production-like testing without the anxiety of disrupting operations.

Furthermore, With the blue/green approach, we can provision an EKS of any version independent of where my current production EKS is. For instance, promoting EKS to 1.29 from 1.24, bypassing the need for a gradual upgrade process for EKS. This not only saves valuable time but also streamlines the effort required in the upgrade process.

Beyond the technical realm, a notable aspect influencing our decision is the confidence gained from the ability to seamlessly destroy and recreate an entire production cluster, consider this aspect from disaster management.
Viewing this practice as a healthy approach, we appreciate the opportunity it presents to start afresh without downtime, contributing to the overall resilience and adaptability of our infrastructure.

How did we do it and what were the challenges?

To begin with, there is no prescription on how to achieve an EKS upgrade following Blue/Green. There are almost none to very limited practices across the industry, so you have to understand the EKS ecosystem very well gather its limitations, blockers, and pre-requisites, and then pave the way through it.

Now, here’s the key: No Sophistication. We aim to keep things simple. goal is progress, not perfection. We make things work and then we keep making them better bit by bit.

To start with, we faced a concurrency challenge. We have special workloads (ie, we call them workers) that cannot run concurrently which means we cannot have the same workers running in two different clusters at the same time, so we have to find a way to make sure that only workers will always be running in one place till the end of the blue-green upgrade.

Let’s break it down in simpler terms! So, here’s the deal (it’s time for a little tech talk): we’ve got a gigantic Monorepo where we keep all our application helm-charts. We use ArgoCD for our continuous delivery into EKS clusters from this Monorepo.

Our ArgoCD is set up to deploy charts as soon as a new cluster (ie, Green) comes online. But we don't want workers in the new cluster (ie, concurrency challenge). Tricky situation for us!

So, what do we do? We take a different route! We temporarily stopped some workloads from going to the new “Green” EKS cluster. It’s like pressing pause on a specific group of tasks. After completing the switching of traffic using DNS, we hit pause on the old “Blue” setup and unpause tasks into the new “Green” EKS.

Confused right, let’s check the diagram for a better understanding

“Temporarily Blocking Workers for Deployment on the ‘Green’ EKS Cluster”

As we can see whenever we’re switching the traffic to the “Green” cluster, we use Kyverno policies (Kyverno is an Admission Controller that can limit requests to create, delete, and modify objects based on a defined policy) to put a temporary hold on the sync between ArgoCD and the cluster there. Once we’ve completely shifted all the traffic to the “Green” cluster, we lift this hold and start sync status again. But, here’s the twist — while we’re doing that, we also put a hold on the sync to the “Blue” cluster, and as we mentioned this only affects the workers

“Permanently Blocking Workers for Deployment on the ‘Blue’ EKS Cluster”

Since we talked about Kyverno, Here’s a quick peek into how we set up Kyverno policies to handle the blocking process.

Simply, We’re using ClusterPolicy to filter out resources that have a specific naming convention starting with “worker-*”. When it finds a match, it stops those resources from being deployed into the cluster.

Now, let’s talk about shifting the traffic between the two clusters. Since we have both the “Blue” and “Green” clusters, we want to move the traffic gradually from the “Blue” to the “Green”. To manage this, we use the AWS Route53 weighted routing feature. It helps us control the percentage of requests going to each cluster. By keeping an eye on metrics and the number of unsuccessful requests, we can understand how well the “Green” cluster is doing. If things go wrong, we can easily switch back to the “Blue” cluster to avoid any serious issues in the system.

After discussing various aspects like applying policies and shifting traffic, let’s dive into how we tackled the actual work — did we handle it manually or leverage some automated magic?

From the start, our focus was on something other than achieving perfection. Instead, we aimed to evaluate if this approach was viable or if we needed to explore alternative solutions. So, we kicked things off by embracing imperfection and delving into manual tasks. Why?

Well, working manually helped us identify the toil tasks — those repetitive and time-consuming jobs. By getting our hands dirty with manual work, we gained insights into what needed attention and where we could later introduce automation to eliminate toil.

Let’s dive into our journey and see how things have changed.

Initially, we dealt with a bunch of complicated pull requests. These were created manually, and they involved a large number of files.

Then, we decided to level up our approach. While keeping the same number of files, we stepped up our game by using a GitHub Action. This helped us create some of those pull requests automatically, which means we don’t have to do as much manual work, making our process more efficient and adding an introduction for automation.

Now, things have gotten even smoother. We use a popular Internal Developer Platform (IDP) tool Backstage to create a new cluster. With just a few settings, we can effortlessly spin up a brand-new EKS cluster.

Conclusion

What we’ve achieved so far is like a beta version of what we’re aiming for. There’s still work to be done for more improvements. Along the way, we learned that there’s no need to be scared of changes because we have control over the whole process. This boosts our confidence, especially when upgrading critical components like API Gateways, We need a safe and efficient strategy to keep our system running smoothly without affecting it.

However, it’s essential to note that while this approach works well for our setup, there are considerations, especially when dealing with clusters that contain Stateful applications. In such cases, handling scenarios like data replication between blue and green clusters becomes a crucial factor to consider.

--

--