Resilience at Hotels.com (Part 1 — Kube-Monkey)

Nikos Katirtzis
The Hotels.com Technology Blog
5 min readMay 13, 2019

--

On the right side of the picture you can see a monkey… and kube… keep reading.

When we wrote our first blog post that’s related to resilience engineering we didn’t think about starting a series of posts in that area, mainly due to the following facts:

  • We didn’t have enough exposure to resilience testing and resilience engineering (we don’t have even now).
  • We were still at early stages in terms of our cloud migration from our data centres to AWS with Kubernetes (now we are not).

Since then lots of things have happened, many services are currently running in AWS in production while we are also collaborating with the other Expedia Group brands in the area of resilience engineering considering that we’re all tackling similar issues.

That said, from now on you’ll hear more about what we do at Hotels.com in that area!

This blogpost focuses on Kube-Monkey, an implementation of Netflix’s Chaos Monkey tool for Kubernetes clusters which we started using a few months ago.

Why we needed Kube-Monkey

During the last year many teams have migrated services from our data centre to our Kubernetes platform in AWS. Where applications previously ran on fixed hosts for their lifetime, there are two new types of change we must be prepared for:

  1. Kubernetes dynamically managing the lifecycle of our applications.
  2. The EC2 servers that underlie our platform are ephemeral and may fail or be replaced at any time.

Development teams should make sure that their app is a good citizen and can handle these changes. This means an app should:

  1. Provide a good quality of service quickly as new instances may be created at any time.
  2. Be resilient to changes in its dependencies that come as a result of the flexibility of the new platform.

The cost of unknown unknowns

For some cases it is possible to prepare for our applications failing ahead of time. At the simplest level we do this with exception handling and circuit breaking, and these can be verified with unit tests. We can also simulate more complex failures with integration tests that inject failures at the tcp level, for example with ToxiProxy (you can read more about how we do it in our blog post).

This is great for cases that we know about already or suspect might happen. But with distributed systems there may be cascading failures, or infrastructure issues outside the scope of our testing pipeline. How can we be resilient to these unknown unknowns?

The easiest way is to assert that our services should not fail, wait for an issue to happen in production… no?

Perhaps a better approach is to bring forward these kinds of errors in a controlled way so we can fix them the soonest possible.

Exposing issues using Kube-Monkey

At Hotels.com we have deployed Kube-Monkey into our staging and production Kubernetes clusters to help expose these sorts of problems.

At the start of every week day, Kube-Monkey creates a schedule of Kubernetes deployments whose pods it will terminate throughout the day. The daily schedule for production is sent to an internal DL on which our developers can subscribe.

At the scheduled times, Kube-Monkey selects a number of pods from that deployment and terminates them. This is based on labels and is currently opt-in. Kubernetes will then recreate the missing pods, and as this process happens we can understand what happens to this app and its dependencies when it is running at reduced capacity.

Many apps are using this in our staging environment, with a few of them trialing it in prod. Running pod terminations against these services in production has exposed a few issues — both with regard to being a good citizen, and resilience to external factors.

Response times were slower for new instances even after they were marked as able to take traffic and having been through a warmup routine.

Traffic was observed to be unevenly spread across the nodes of the services cluster due to persistent connections.

Traffic evenly spread across the nodes of the services cluster (expected).
Traffic unevenly spread across the nodes of the services cluster (reality).

Another issue we found was that the grace time after pod shutdown that allows for connection draining was less than the default Java DNS cache TTL of 30 seconds. This meant that requests were still sent to pods that had been terminated.

None of these things are easily identifiable without running in a real Kubernetes cluster, but do the tests really need to be run in production? Why not staging? While we can strive to make the environments as similar as possible, even with shadow traffic it is not possible to completely replicate prod, and almost certainly not cost effective as we would need to replicate the entire infrastructure. Provided we have a way to cancel them, controlled experiments in production are the best way to verify our resilience capabilities.

How do our apps opt in

Kube-Monkey is being trialed in production with the intention of it being compulsory for all new apps. We’ve tried to keep the opt-in mechanism as simple as possible and so all apps can just set a single property in their Helm Charts to enable Kube-Monkey. But they also have the option to control the attack by specifying the time between attacks, or the number/percentage of pods to be attacked.

Next Steps

Kube-Monkey is limited to only one sort of attack — terminating a fixed number or percentage of a deployment’s pods. In order to help teams expose a wider range of issues we are planning trials for other similar tools, including Gremlin, Istio’s failure injection capabilities, and an internal solution from another Expedia Group brand.

But the tooling is only part of the story. To maximise its value we need to measure against expected and actual values so that we can track how our resilience improves over time. This means defining an application’s steady state in given situations and is closely related to service level indicators/objectives/agreement.

That’s Kube-Monkey in a nutshell, and it’s only a small part of resilience engineering. Our journey has just begun!

Note: This post was a joint effort of Andy Stanton, Daniel Albuquerque, and myself.

--

--