Achieve a smaller blast radius with Highly Available Kubernetes clusters

Published in

Adevinta Tech Blog

7 min readJul 7, 2022

Delve into the nitty-gritty of crafting highly available Kubernetes clusters. Discover not only the “how” but also the “why” behind the configurations, tools and practices that empower these clusters to be the robust backbone of countless services globally.

At Adevinta, we operate an internal Platform as a Service (PaaS): The Common Platform, where Software Engineers from across Adevinta marketplaces can build, deploy and manage their microservices. These microservices run atop SCHIP, a platform built on top of Kubernetes.

In this article, we explain what SCHIP is, how we achieve high availability whilst operating 30 clusters serving 60k requests per second to more than 20 customers, and what we learned from this experience.

What’s SCHIP?

SCHIP is Adevinta’s multi-tenant Kubernetes distribution, extending vanilla Kubernetess by providing:

Role-Based Access Control (RBAC)
Observability
Managed DNS
Out-of-the-box TLS capabilities

We deploy SCHIP in several regions, mainly in Europe, depending on the tenants’ needs.

SCHIP extends Kubernetes with additional capabilities

How can we support highly-available workloads in multiple Kubernetes clusters?

That was the question we asked ourselves at the beginning of 2020 when we onboarded one of the most significant users of the Common Platform — our focus was to reduce the blast radius of our clusters

Why do we need highly available traffic?

It allows us to:

Ensure our tenants always have a reliable service running on our clusters regardless of underlying clusters undergoing maintenance
Provide service redundancy on tenants’ demands

We envision clusters as cattle; clusters should be able to scale out or in just like pods do. We must be able to move our tenant’s microservices freely amongst clusters without impacting our tenants.

The benefits of multiple clusters

The first step towards high availability is to have the services deployed in more than one cluster, ideally in different regions, to prevent one degraded cluster from disrupting the service. Luckily, there are many tools available that can deploy services to multiple destinations.

After the services are deployed to more than one cluster, we need to ensure that the traffic is distributed to each cluster where the services are deployed. There are several ways to achieve that. In our case, we use DNS records load balancing with the help of External-DNS.

What is External-DNS and why do we use it?

External-DNS is an operator in the Kubernetes cluster that allows us to expose our resources in the cluster to a public DNS server — for example, AWS Route 53.

The resource that External-DNS watches and creates a record from could be an Ingress object, Service object, or even a CustomResourceDefinition (CRD) in Kubernetes.

After setting up External-DNS with the domain name server (AWS Route 53), we align the URL pattern of the services in the clusters.

The External-DNS in each cluster is now able to create a DNS record for each service.

with the same domain name, traffic will be load balanced between clusters

Here, the services may or may not be already highly available.

Highly available: in a normal scenario, the traffic coming in for ‘my-service.schip.com’ should have a balanced load between the deployed clusters.
Not yet highly available: if the ‘my-service.schip.com’ on one of the clusters is degraded, or even worse, one of the two clusters is totally compromised, the traffic is still being routed.

In addition to those two scenarios, we have more than a few clusters and there are many occasions where we would need to perform risky cluster maintenance. Although this might not impact our tenant’s service every time, it would be safer to gradually migrate the traffic to a cluster that is not undergoing maintenance.

So, from the above requirements, we need two additional capabilities:

The ability to route traffic to only the cluster that has a healthy service
The ability to gradually migrate traffic away from cluster under maintenance

Why we use weighted records?

In order for the DNS server to conform with those requirements, we need to use a feature of Route 53 called Weighted records.

This allows us to specify the proportion of the traffic to distribute to each destination of the same domain. With this feature, we are able to control the traffic as we see fit. For example, 100 represents a normal working cluster and if a cluster goes under maintenance, we can set the weight to 0.

However, configuring this for a large number of records may require human intervention and isn’t very scalable.

Creating a traffic-controller operator

To have the flexibility to manage traffic and solve any problems we may face, we created a Kubernetes Operator. This ensures that we make a proper state of the DNS record, whether the service is still healthy or if a cluster is under maintenance.

How to know which cluster is under maintenance?

To make the Kubernetes Operator aware of the cluster maintenance, we created a table in an AWS DynamoDB with the list of our clusters.

Table of clusters with their desired weighted

This table can either be updated manually when the SRE performs an operation that will likely affect the stability of a given cluster, or it can be updated automatically by relevant events, for example, if a cluster shuts down.

How does the traffic controller work?

The traffic controller watches this DynamoDB table for the desired weight configured at the cluster level and ensures that the weight of the services in the cluster is set accordingly. It looks out for changes on Ingress objects in the cluster and creates or updates a Custom Resource Definition (CRD), called DNSEndpoints, that represents the DNS records we want.

As mentioned above, External-DNS can manage our DNS records, not only from Ingress, but also using a CRD as a source. In this case, we configure it to use DNSEndpoints as a source for our DNS records instead of Ingress objects. Usefully, we can hide this weight interface from our tenants, so when they see their Ingresses, they can’t see anything about the weight at all.

When we perform a full cluster rebuild or risky operations such as networking fixes, we can migrate the traffic to a safer cluster easily by changing the value in the DynamoDB table.

However, this is still not enough to fulfil all of our requirements. We also want the traffic to be routed to other clusters while the service in another cluster is degraded.

Taking advantage of the flexibility we gained from developing our own operator, we’re able to enhance our traffic controller to also watch for changes to the underlying pods of each service and verify if there are no healthy pods behind the service. In that case, we’ll set the weight for the affected service in this cluster to 0, even though the cluster itself is healthy.

Despite the weight being 100 at the cluster level, it’s set as 0 if the service is degraded

Now, the traffic of the service will be directed to a healthier cluster.

What we learnt?

This approach allows us to perform risky operations on our clusters, for example, full cluster rebuilds where we need to change every node in the cluster, with greater peace of mind.
We reduce the risk of impacting tenants’ traffic without them even noticing the underlying changes to the infrastructure. This removes the need to plan and announce the scheduled maintenance to our tenants.
Developing our own operator opens the door of opportunities for us to keep evolving the platform. We could hook it to different event sources that allow us to manage traffic per event in an automatic way. For example, we could subscribe to a cluster deployment event with a cluster rebuild flag, then automatically switch the traffic to another cluster.
Having the ability to reduce traffic weight for an unhealthy service is really useful but it’s not the only thing we need to do to have a highly available service. We also need the development team to implement a robust healthiness and liveness probe so the operator can control the traffic more effectively.

Thanks for reading and I hope this article provides useful insight to those facing similar challenges.