I don’t know about resilience testing and so can you!

At Hotels.com we’ve been busy moving our systems to AWS. Good things aside, this move highlighted fragilities on some of the components that power our website, mostly due to the more dynamic nature of the cloud when compared to our old datacenter.

My team has been following this topic for some time now. We’re not subject-matter experts and most of us had little to no previous experience on this so we attended some meetups, read some blogs, watched some videos and then tried to apply the learnings in our work.

Contrary to what many think, the goal of chaos engineering is not to randomly break things. Instead you want to run tests in very controlled scenario to validate different hypothesis. These hypothesis represent assumptions about how think your application will behave in the presence of a failure.

In this post I’ll show you how we started using Toxiproxy, a TCP proxy that simulates different network conditions, to help with one of the most common failures: a degradation in the latencies of an upstream service.

Designing an experiment

  1. We start by defining what good looks like. Usually this means using HTTP status codes, response times, or other metrics to measure if the system is operating within normal parameters.
  2. We hypothesise that when exposed to a failure event (eg.: added latency on the upstream service) the service will continue to operate and will be able to continue to serve requests.
  3. We introduce failure variables and compare the differences.

Much like with A/B testing it’s common to use a control group and an experiment group when running these experiments. However, since we want to run these resilience tests on our build pipelines we decided to run them against a single instance of our application and instead to run the experiment in several stages, where we add and remove the failure events and observe the results.

This reduces the authenticity of our tests as there is a risk of propagating state from one stage to another, but it’s a small price to pay to keep our setup simple.

Running our first experiment

Based on the rules above, let’s design and run our first chaos experiment.

For the purposes of this example we’ll use a very basic setup: one app with 2 dependencies, a “primary” service and a “fallback” service that we can use if the first one is unavailable.

Hypothesis: If our primary upstream is unhealthy, the application will react by opening the circuit breakers and will use the fallback service instead.

In practical terms this hypothesis means that when our primary service is slow or unavailable, our application should automatically open the circuit breakers to prevent cascading failures and should start using the fallback service.

We will start by performing a number of requests to our app to validate the healthy state (our baseline). Then we will simulate a degradation in the primary upstream and validate our assumptions about the circuit breakers and about the automatic switch to the fallback. Finally we will end the simulation by validating that the app returns to it’s normal state once the failure is removed.

Sprinkle with Testcontainers and Toxiproxy

To make all of this possible we’re using testcontainers. This is a great Java library that allow us to programatically start Docker instances. In this example we are using it to spin up our Toxiproxy container as well as a Cassandra and Wiremock instances that we will use as our primary and fallback services.

// primary service
val primaryContainer = new GenericContainer("cassandra:3.11.1")
.withExposedPorts(9042)
.waitingFor(new LogMessageWaitStrategy().withRegEx(".*Starting listening for CQL clients.*\\s"))
.start()
// fallback service
val fallbackContainer = new GenericContainer("rodolpheche/wiremock:2.10.1-alpine")
.withExposedPorts(8080)
.waitingFor(Wait.forHttp("/__admin/")).
.start()

The code is very simple to understand even if you’ve never used Testcontainers before. We just specify the docker image we want to use, the ports we want to expose, and a wait strategy so that it can determine when the container is up and running (notice how we’re using two different strategies above).

Now that we have our initial setup ready, we add toxiproxy so that we can simulate failures in the communication between these services. It will look a bit like the following diagram, where toxiproxy will act as a proxy between our app and our victim.

We then start Toxiproxy, in exactly the same way we did for the other containers:

var toxiproxyContainer: GenericContainer = new GenericContainer("shopify/toxiproxy:2.1.0")
.withExposedPorts(upstreamPort, 8474)
.start()

Now that Toxiproxy is working we create a proxy for our upstream where we want to simulate the network failures.

// setup a client to interact with toxiproxy
var toxiproxyClient = new ToxiproxyClient("localhost", toxiproxyContainer.getMappedPort(8474))
// create a proxy for our primary service
val primaryServiceProxy = toxiproxyClient.createProxy("primaryService", "0.0.0.0:" + port, dockerhost + ":" + primaryContainer.getMappedPort(9042))

PS: Testcontainers supports linked containers which should make this mapping cleaner and with less random ports flying around. We couldn’t use it due to some specificities of our build pipeline but feel free to try it.

Running our test

Now that we have our setup in place, we run our test.

We fire requests against our application and use the HTTP responses status codes, response times, as well as data that we collect from a metrics endpoint to assert that it performed as expected and that the circuit breaker did not change status during the test.

Next we introduce a network failure by simulating high latency from the upstream. This is how it looks like:

var latencyToxic = primaryServiceProxy.toxics().latency("latency", DOWNSTREAM, SECONDS.toMillis(10))

This is one of the many types of failures supported by Toxiproxy, and it adds a 10 seconds delay to all upstream requests for the given proxy.

We then rerun the same test and perform our validations:

  • the circuit breaker did open
  • the fallback service was used
  • we still get valid responses regardless

Finally we remove the toxic:

latencyToxic.remove()

And rerun the test to confirm that the circuit breaker closed automatically after the failure was removed.

Commit, push

This was a very simple example to demonstrate the capabilities of this tool and how to integrate it on a build pipeline.

Apart from the small initial learning curve, having these types of tests will help you in building confidence in the overall resiliency of your services and to highlight where the weak spots are.

You can also use the same tool and have Developers or QAs injecting failures randomly in a staging environment and observe how your components work. You will want to automate some of these tests, but exploratory tests are a good way to introduce the chaos engineering discipline into a team and to expose them to some of its practices.

Update

Since the time I wrote this post the amazing folks at Testcontainers have made available a new module for Toxiproxy that simplifies the usage of these two tools. Head out to https://www.testcontainers.org/modules/toxiproxy/ and try it out.

(thank you Richard North!)