EXPEDIA GROUP TECHNOLOGY — ENGINEERING

An Exercise in Software Resiliency on Kubernetes

Describing the challenges involved in building resilient software systems

Sasidhar Sekar
Expedia Group Technology

--

Computer at desk overlooking city views.

At Expedia Group™️, we strive to make resilience to failures a prime objective. As a part of this, we have been building a new on-road compute platform that consolidates the best practices learnt over the years and offers application owners a reliable, feature-rich and easy to use out-of-the-box platform, where they can deploy and run their applications.

This platform is based on Kubernetes, and in the last few months, our Reliability Engineering team has been working on identifying and remediating reliability concerns in these Kubernetes clusters using Chaos Engineering.

In this blog post, I wanted to get into the details of the complexity of the challenges in this space, how we approached them, and what we did to mitigate them.

Background

Image of a piece of paper, pen and eraser with a question mark drawn on the paper.
Photo by Mark Fletcher-Brown on Unsplash

Our Kubernetes clusters use GitOps/FluxCD to manage our infrastructure and applications. The benefits of doing so, as stated in the GitOps docs:

… so that whole system is described declaratively and version controlled (most likely in a Git repository), and having an automated process that ensures that the deployed environment matches the state specified in a repository

One of the key components of FluxCD is the Source Controller. This is the component that pulls artefacts from source repositories (Git/Helm, etc.) for the other tools in GitOps to work with.

Any failures in Source Controller’s ability to pull artefacts could severely impact the deployment/upgrade workflow in these Kubernetes clusters. That means if the source controller is down in a cluster, new applications can no longer be deployed onto these clusters, and existing applications can no longer be upgraded. Hence, the focus of the chaos experiment was:

  1. to understand the source controller’s response to any connectivity failures* to the source repository
  2. evaluate if the response is acceptable and
  3. work on mitigations, if it is not

* Connectivity failures are not the only ones that can impact the retrieval of artefacts but are one of the most likely failures to expect in production. So, this was chosen as the first variable in my chaos experiment

Chaos experiment

Image of test tubes and a conical flask
Photo by Alex Kondratiev on Unsplash

Initial setup

The initial setup of the Source Controller looked like this:

Initial setup of the source-controller showing two pods running in a leader-follower mode, with only the leader connecting to the container registry
Figure 1: Initial Setup

There were 2 pods running in a leader-follower model. The container in the leader pod was in the “Ready” state and the container in the follower was marked “Not Ready”. This meant that any artefact retrieval is done only via the leader pod.

Hypothesis

One of the first steps in any chaos experiment is to hypothesise about the expected response to a given failure. For this particular experiment, we had the following hypothesis:

Table 1: Expected Behaviour

Failure simulation

There are a few ways to simulate a connectivity failure — You could:

  • block ingress/egress/all network traffic to/from the host where the leader is running
  • block ingress/egress/all network traffic to/from the leader pod

But these approaches tend to have a wider blast radius, meaning — they affect connectivity not only to the container registry but also to/from other network dependencies. Hence, we chose an approach that simulates the connectivity failure via a DNS disruption.

Introducing a delay on the DNS queries, so that DNS resolution times out and the controller can no longer connect to the container registry
Figure 2: Connectivity Disruption

Tools used

We used Pumba to simulate the connectivity failure during this chaos experiment.

$ pumba_linux_amd64 --log-level=debug netem --tc-image gaiadocker/iproute2:latest --pull-image=false --duration 60m --target 172.20.0.10 delay --time 30000 "re2:^k8s_manager_source-controller"

Experiment result and findings

Table 2: Actual Behaviour

This was not expected! When you see 2 replicas running in a leader-follower model, it is reasonable to expect that when the leader fails, the follower will automatically be elected as the leader and the service will continue uninterrupted.

But, why didn’t this happen?

Result analysis

Image showing a white board with sticky notes
Photo by Jason Goodman on Unsplash

An excerpt from the Kubernetes documentation on simple leader election:

Once the election is won, the leader continually “heartbeats” to renew their position as the leader, and the other candidates periodically make new attempts to become the leader. This ensures that a new leader is identified quickly, if the current leader fails for some reason.

The important part to focus on here is “if the current leader fails”.

So, when does the leader fail?

The leader fails when it no longer sends out a “heartbeat”, to renew its position.

Does the leader fail during the connectivity failure to the upstream container registry?

No.

Here’s what happens during the chaos experiment:

The leader attempts to resolve the DNS name to connect to the container registry. DNS resolution times out, making connectivity impossible. The leader retries again, and again, and again … While this is happening, the leader replica continues to send “heartbeats” in order to retain its position as the leader
Figure 3: Root Cause of Failure to Recover

The primary reason for the service never recovering from the connectivity failure is that the connectivity failure does not impact the leader election process.

Risk mitigation/solution design

Image of a mahjong (strategy game) table
Photo by Albert Hu on Unsplash

Now that we knew what was happening, we needed to do something to make sure that we mitigate the situation.

Desired behaviour = Service recovers quickly from the connectivity failure. Actual behaviour = Service never recovers from the connectivity failure. What needs to be done to make actual behaviour inline with the desired?
Figure 4: Desired vs Actual

The solution we thought of was to use the liveness probes to detect and recover from the upstream connectivity failure. An excerpt from Kubernetes documentation on liveness probes:

Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations.

Reading through the above, it felt like the liveness probe was tailor-made to solve problems like this — The service is unable to recover on its own. So, it seems like a good idea to detect the connectivity failure via a liveness check and restart the container in the leader when the check fails.

Kubelet checks for the health of the container via the configured liveness probe. If the check results in a failure, restarts the container
Figure 5: Liveness Probe

It is to be noted that the service already had its liveness probe configured.

Snippet 1: Current Liveness Probe Configuration

But it was configured to check only the health of the container — whether the container was running and responding to health check queries. It did not check whether the service was fully functional — was it serving its intended purpose.

Current liveness configuration checks only for the container health. It does not check for the service health. Hence, even when the service endpoint is affected by an upstream connectivity failure, because the container’s healthcheck endpoint is not affected, the liveness probe continues to succeed and hence the kubelet takes no action
Figure 6: Why the current liveness check did not help the pod recover?

Hence, even when the service endpoint (“/”) is affected by an upstream connectivity failure, because the container’s healthcheck endpoint (“/healthz”) is not affected, the liveness probe continues to succeed, and the Kubelet takes no action.

In order to make the Kubelet aware of upstream connectivity failure, we decided to change the liveness probe to check the service health instead of just the container health.

Snippet 2: New Liveness Probe Configuration

The effect of this change is that the liveness probe fails on upstream connectivity failure, which leads to the Kubelet restarting the container. While the container is being restarted, it fails to send heartbeats to renew its position as leader. So, the follower becomes the leader now and service recovers successfully.

liveness probe fails on upstream connectivity failure, which leads to the Kubelet restarting the container. While the container is being restarted, it fails to send heartbeats to renew its position as leader. So, the follower becomes the leader now and service recovers successfully
Figure 7: Successful Recovery

Aftermath

Image of a water balloon bursting
Photo by Pascal Meier on Unsplash

Sadly, the job didn’t end here. The change was tested in the dev cluster — all was A-OK and promoted to the test clusters. This was when we were notified that some of the clusters were having trouble with this component.

My first thought was: “Why only some clusters?” If there was a problem, it had to show up in all the clusters. So, why did it show up in only some!?

The answer lay in what happens during a pod startup in Kubernetes.

There are two major phases in a pod startup — Container Start and Service Initialisation. Assuming there is only one container in the Pod, as soon as the container is up and running, the “Container Start” phase is complete. Post this, the configured in-container actions are performed in order to initialise the service.
Figure 8: Pod Startup and Kubernetes Probes

I would divide the Pod startup into two major phases:

  1. Container Start
  2. Service Initialisation

Assuming there is only one container in the Pod, as soon as the container is up and running, the “Container Start” phase is complete. Post this, the configured in-container actions (warmup scripts, loading config, connecting to dependencies, etc.) are performed in order to initialise the service.

The critical point to note in Figure 8 is that, with the default configuration for the Kubernetes probes, Kubelet can start the liveness and readiness checks as soon as the “Container Start” phase is complete.

Kubelet can start the liveness and readiness checks as soon as the “Container Start” phase is complete

Why did this matter to Source Controller?

Before we made the change, the Source Controller’s liveness probe was configured to only check the container health (Figure 6). As soon as the “Container Start” phase was complete, the container’s “/healthz” endpoint becomes active and the liveness checks succeeded immediately.

Note: The container will still not be “Ready” to take traffic until the “Service Initialisation” phase is complete.

But, we changed the liveness probe to check the service health instead of the container health , so the Kubelet can detect issues with the service and recover by restarting the service.

This meant that when the liveness checks were fired as soon as the “Container Start” phase was complete, the service was still being initialised. So, the liveness checks that now check the service health, failed. In response to the failed liveness check, the Kubelet restarted the container. This kept happening over and over indefinitely*.

Before, the liveness probe was configured to check for container health. So, even when the checks were triggered before service initialisation, the check passed and the container started up successfully. Post startup though, when the service health degraded, because the probe checked only the container health, it could not detect the degradation and hence no action was taken. After the change, when the checks were triggered before service initialisation, container got restarted repeatedly
Figure 9: Pod Startup — Before and After

* In reality, with the default probe configuration, this can recover if the “Service Initialization” phase ever completes within approximately 30 seconds (default configurations:- failureThreshold: 3, timeoutSeconds: 1). More on this, in the next section.

What next?

Image of a “Road ahead” sign
Photo by Joe Parkin on Unsplash

Among Kubernetes professionals and enthusiasts, the opinion on the idea of using liveness probes to detect service health is divided — possibly because of the issues described in this post so far.

My thoughts — The reason why we did all of this is to solve one fundamental problem.

How can we make this service resilient to an upstream failure!?

If using a liveness probe to check service health is going to allow us to detect upstream failures faster and help the service recover automatically, we felt that it was an option definitely worth considering.

Nevertheless, we still hand a problem at hand — How do we work around the problem of indefinite container restarts? There were a couple of options.

  1. Startup probes: Kubernetes supports the use of “Startup probes” to know when a container has started and the service initialised. When configured to check for service health, this disables liveness and readiness checks until the service is up and running. This would ensure that the liveness probes configured to check for service health, won’t interfere with the application startup. Post startup though, these liveness probes will continue to check for service health and help Kubelet restart the container when the checks fail — allowing for automatic detection and recovery on service failure
  2. Configure the liveness probe for a slow startup: Alternatively, you can increase the value of failureThreshold and/or timeoutSeconds in the liveness probe configuration. This will ensure that the liveness probe either does not give up and restart the container quickly (3 seconds, default)

… the preferred solution is to make use of the “Startup probes”

Option 2, above might help during the startup but post-startup, an increased failureThreshold and/or timeoutSeconds configuration will slow down the detection of failures and correspondingly, recovery. Hence, the preferred solution is to make use of the “Startup probes”.

The liveness check no longer starts until the service health is verified by the startup probe. This avoids liveness check interfering with the startup. Post-startup, the liveness check is able to verify the service health and allow the Kubelet to take action (restart container), when the service health is degraded
Fig 10: Solution with Startup Probes

This is what we will be doing next — We will be configuring the startup probes and using the liveness probes to react to service health degradation.

Learnings

Image of someone taking notes
Photo by Glenn Carstens-Peters on Unsplash

In this blog post, I have tried to explain:

  1. The problem we were trying to solve
  2. The solution implemented
  3. How the solution created a new, and bigger problem of its own
  4. What can help me work around the issue, and achieve my goals

Looking back, I have identified these as the key learnings:

  • You cannot judge a book by its cover: The Source Controller had multiple replicas, in a leader-follower configuration. It would have been easy to assume that the component is resilient to failures — when the leader is in trouble, the follower takes over. But, that’s not what happened in reality.
    Learning: However sensible they might seem, assumptions can never replace verification. If you have expectations for a component, do not just assume it’ll meet the expectations. Always verify that it meets them.
  • It is not all about your application: If your applications were running in a vacuum, you only have one thing to consider and act upon. But, this is rarely the case, particularly in a large enterprise like Expedia Group. Applications are usually a part of a complex system, running on and alongside a number of other components.
    Learning: If you are an application owner, even though you do not need to know everything about the platform and/or dependencies, it helps to understand the ecosystem in which your application is running — Kubernetes, in my case
  • If it does not work everywhere, it works nowhere: We verified that the changes worked successfully in “dev” clusters. We verified that the changes worked even in some of the “test” clusters. We could have promoted this to production but thankfully got alerted to the failures in certain “test” clusters, avoiding a certain incident in production.
    Learning: Even though all of our Kubernetes clusters were similar and built using the same automation, they are rarely the same. This is mainly due to the state they hold — the other applications running in these clusters, the load they are experiencing at a given point in time, etc., Hence, the issue I described in this post was uncovered only in some clusters. So, it is important to remember: Similar ≠ Same
  • If you can’t see it, you can’t know about it: It is near impossible to exhaustively verify the expected behaviour of all the components across all the environments. Yet, it is not OK to let changes through to production in hope. The solution is somewhere in between.
    Learning: Irrespective of whether you tested something to be working or not, always have observability on your components across all environments

Learn more about technology at Expedia Group

Credits: Nitin Mistry, Andrew Woolterton, Trevor Bongomin, Timothy Ehlers, Kiichiro Okano

References

--

--