EXPEDIA GROUP TECHNOLOGY — ENGINEERING
An Exercise in Software Resiliency on Kubernetes
Describing the challenges involved in building resilient software systems
At Expedia Group™️, we strive to make resilience to failures a prime objective. As a part of this, we have been building a new on-road compute platform that consolidates the best practices learnt over the years and offers application owners a reliable, feature-rich and easy to use out-of-the-box platform, where they can deploy and run their applications.
This platform is based on Kubernetes, and in the last few months, our Reliability Engineering team has been working on identifying and remediating reliability concerns in these Kubernetes clusters using Chaos Engineering.
In this blog post, I wanted to get into the details of the complexity of the challenges in this space, how we approached them, and what we did to mitigate them.
Background
Our Kubernetes clusters use GitOps/FluxCD to manage our infrastructure and applications. The benefits of doing so, as stated in the GitOps docs:
… so that whole system is described declaratively and version controlled (most likely in a Git repository), and having an automated process that ensures that the deployed environment matches the state specified in a repository
One of the key components of FluxCD is the Source Controller. This is the component that pulls artefacts from source repositories (Git/Helm, etc.) for the other tools in GitOps to work with.
Any failures in Source Controller’s ability to pull artefacts could severely impact the deployment/upgrade workflow in these Kubernetes clusters. That means if the source controller is down in a cluster, new applications can no longer be deployed onto these clusters, and existing applications can no longer be upgraded. Hence, the focus of the chaos experiment was:
- to understand the source controller’s response to any connectivity failures* to the source repository
- evaluate if the response is acceptable and
- work on mitigations, if it is not
* Connectivity failures are not the only ones that can impact the retrieval of artefacts but are one of the most likely failures to expect in production. So, this was chosen as the first variable in my chaos experiment
Chaos experiment
Initial setup
The initial setup of the Source Controller looked like this:
There were 2 pods running in a leader-follower model. The container in the leader pod was in the “Ready” state and the container in the follower was marked “Not Ready”. This meant that any artefact retrieval is done only via the leader pod.
Hypothesis
One of the first steps in any chaos experiment is to hypothesise about the expected response to a given failure. For this particular experiment, we had the following hypothesis:
Failure simulation
There are a few ways to simulate a connectivity failure — You could:
- block ingress/egress/all network traffic to/from the host where the leader is running
- block ingress/egress/all network traffic to/from the leader pod
But these approaches tend to have a wider blast radius, meaning — they affect connectivity not only to the container registry but also to/from other network dependencies. Hence, we chose an approach that simulates the connectivity failure via a DNS disruption.
Tools used
We used Pumba to simulate the connectivity failure during this chaos experiment.
$ pumba_linux_amd64 --log-level=debug netem --tc-image gaiadocker/iproute2:latest --pull-image=false --duration 60m --target 172.20.0.10 delay --time 30000 "re2:^k8s_manager_source-controller"
Experiment result and findings
This was not expected! When you see 2 replicas running in a leader-follower model, it is reasonable to expect that when the leader fails, the follower will automatically be elected as the leader and the service will continue uninterrupted.
But, why didn’t this happen?
Result analysis
An excerpt from the Kubernetes documentation on simple leader election:
Once the election is won, the leader continually “heartbeats” to renew their position as the leader, and the other candidates periodically make new attempts to become the leader. This ensures that a new leader is identified quickly, if the current leader fails for some reason.
The important part to focus on here is “if the current leader fails”.
So, when does the leader fail?
The leader fails when it no longer sends out a “heartbeat”, to renew its position.
Does the leader fail during the connectivity failure to the upstream container registry?
No.
Here’s what happens during the chaos experiment:
The primary reason for the service never recovering from the connectivity failure is that the connectivity failure does not impact the leader election process.
Risk mitigation/solution design
Now that we knew what was happening, we needed to do something to make sure that we mitigate the situation.
The solution we thought of was to use the liveness probes to detect and recover from the upstream connectivity failure. An excerpt from Kubernetes documentation on liveness probes:
Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations.
Reading through the above, it felt like the liveness probe was tailor-made to solve problems like this — The service is unable to recover on its own. So, it seems like a good idea to detect the connectivity failure via a liveness check and restart the container in the leader when the check fails.
It is to be noted that the service already had its liveness probe configured.
But it was configured to check only the health of the container — whether the container was running and responding to health check queries. It did not check whether the service was fully functional — was it serving its intended purpose.
Hence, even when the service endpoint (“/”) is affected by an upstream connectivity failure, because the container’s healthcheck endpoint (“/healthz”) is not affected, the liveness probe continues to succeed, and the Kubelet takes no action.
In order to make the Kubelet aware of upstream connectivity failure, we decided to change the liveness probe to check the service health instead of just the container health.
The effect of this change is that the liveness probe fails on upstream connectivity failure, which leads to the Kubelet restarting the container. While the container is being restarted, it fails to send heartbeats to renew its position as leader. So, the follower becomes the leader now and service recovers successfully.
Aftermath
Sadly, the job didn’t end here. The change was tested in the dev cluster — all was A-OK and promoted to the test clusters. This was when we were notified that some of the clusters were having trouble with this component.
My first thought was: “Why only some clusters?” If there was a problem, it had to show up in all the clusters. So, why did it show up in only some!?
The answer lay in what happens during a pod startup in Kubernetes.
I would divide the Pod startup into two major phases:
- Container Start
- Service Initialisation
Assuming there is only one container in the Pod, as soon as the container is up and running, the “Container Start” phase is complete. Post this, the configured in-container actions (warmup scripts, loading config, connecting to dependencies, etc.) are performed in order to initialise the service.
The critical point to note in Figure 8 is that, with the default configuration for the Kubernetes probes, Kubelet can start the liveness and readiness checks as soon as the “Container Start” phase is complete.
Kubelet can start the liveness and readiness checks as soon as the “Container Start” phase is complete
Why did this matter to Source Controller?
Before we made the change, the Source Controller’s liveness probe was configured to only check the container health (Figure 6). As soon as the “Container Start” phase was complete, the container’s “/healthz” endpoint becomes active and the liveness checks succeeded immediately.
Note: The container will still not be “Ready” to take traffic until the “Service Initialisation” phase is complete.
But, we changed the liveness probe to check the service health instead of the container health , so the Kubelet can detect issues with the service and recover by restarting the service.
This meant that when the liveness checks were fired as soon as the “Container Start” phase was complete, the service was still being initialised. So, the liveness checks that now check the service health, failed. In response to the failed liveness check, the Kubelet restarted the container. This kept happening over and over indefinitely*.
* In reality, with the default probe configuration, this can recover if the “Service Initialization” phase ever completes within approximately 30 seconds (default configurations:- failureThreshold: 3, timeoutSeconds: 1). More on this, in the next section.
What next?
Among Kubernetes professionals and enthusiasts, the opinion on the idea of using liveness probes to detect service health is divided — possibly because of the issues described in this post so far.
My thoughts — The reason why we did all of this is to solve one fundamental problem.
How can we make this service resilient to an upstream failure!?
If using a liveness probe to check service health is going to allow us to detect upstream failures faster and help the service recover automatically, we felt that it was an option definitely worth considering.
Nevertheless, we still hand a problem at hand — How do we work around the problem of indefinite container restarts? There were a couple of options.
- Startup probes: Kubernetes supports the use of “Startup probes” to know when a container has started and the service initialised. When configured to check for service health, this disables liveness and readiness checks until the service is up and running. This would ensure that the liveness probes configured to check for service health, won’t interfere with the application startup. Post startup though, these liveness probes will continue to check for service health and help Kubelet restart the container when the checks fail — allowing for automatic detection and recovery on service failure
- Configure the liveness probe for a slow startup: Alternatively, you can increase the value of failureThreshold and/or timeoutSeconds in the liveness probe configuration. This will ensure that the liveness probe either does not give up and restart the container quickly (3 seconds, default)
… the preferred solution is to make use of the “Startup probes”
Option 2, above might help during the startup but post-startup, an increased failureThreshold and/or timeoutSeconds configuration will slow down the detection of failures and correspondingly, recovery. Hence, the preferred solution is to make use of the “Startup probes”.
This is what we will be doing next — We will be configuring the startup probes and using the liveness probes to react to service health degradation.
Learnings
In this blog post, I have tried to explain:
- The problem we were trying to solve
- The solution implemented
- How the solution created a new, and bigger problem of its own
- What can help me work around the issue, and achieve my goals
Looking back, I have identified these as the key learnings:
- You cannot judge a book by its cover: The Source Controller had multiple replicas, in a leader-follower configuration. It would have been easy to assume that the component is resilient to failures — when the leader is in trouble, the follower takes over. But, that’s not what happened in reality.
Learning: However sensible they might seem, assumptions can never replace verification. If you have expectations for a component, do not just assume it’ll meet the expectations. Always verify that it meets them. - It is not all about your application: If your applications were running in a vacuum, you only have one thing to consider and act upon. But, this is rarely the case, particularly in a large enterprise like Expedia Group. Applications are usually a part of a complex system, running on and alongside a number of other components.
Learning: If you are an application owner, even though you do not need to know everything about the platform and/or dependencies, it helps to understand the ecosystem in which your application is running — Kubernetes, in my case - If it does not work everywhere, it works nowhere: We verified that the changes worked successfully in “dev” clusters. We verified that the changes worked even in some of the “test” clusters. We could have promoted this to production but thankfully got alerted to the failures in certain “test” clusters, avoiding a certain incident in production.
Learning: Even though all of our Kubernetes clusters were similar and built using the same automation, they are rarely the same. This is mainly due to the state they hold — the other applications running in these clusters, the load they are experiencing at a given point in time, etc., Hence, the issue I described in this post was uncovered only in some clusters. So, it is important to remember: Similar ≠ Same - If you can’t see it, you can’t know about it: It is near impossible to exhaustively verify the expected behaviour of all the components across all the environments. Yet, it is not OK to let changes through to production in hope. The solution is somewhere in between.
Learning: Irrespective of whether you tested something to be working or not, always have observability on your components across all environments
Learn more about technology at Expedia Group
Credits: Nitin Mistry, Andrew Woolterton, Trevor Bongomin, Timothy Ehlers, Kiichiro Okano