Chaos Engineering 101

Say you’re developing a new web application — the next great thing everybody has been waiting for. After all the work you’ve done, it’s time to finally launch the service to the first customers. Now the hard question:

When do you know that the application is ready for production? More specifically, how can you be sure that the (distributed) system you’ve built is resilient enough to survive use in production?

The truth is: you can never be sure. You don’t know what’s going to happen. There will always be something that can — and will — go wrong, from self-inflicted outages caused by bad configuration pushes or buggy images to events that are outside your control like denial-of-service attacks or network failures. No matter how hard you try, you can’t build perfect software (or hardware, for that matter). Nor can the companies you depend on.

We live in an imperfect world. That’s just how it is. Accept it and focus on the things you can control: creating a quality product that is resilient to failures. Build software that is able to cope with both expected and unexpected events; gracefully degrade whenever necessary. As the saying goes, “Hope for the best and prepare for the worst”.

But how? How can you make sure you’re ready for disaster?

The first thing you need to do is to identify problems that could arise in production. Only then will you be able to address systemic weaknesses and make your systems fault-tolerant.

This is where Chaos Engineering comes in.

Principles of Chaos Engineering

Rather than waiting for things to break in production at the worst time, the core idea of Chaos Engineering is to proactively inject failures in order to be prepared when disaster strikes.

Netflix, a pioneer in the field of automated failure testing (and, by the way, also a great video-streaming service), defines Chaos Engineering as follows:

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

As a Chaos Engineer, you test a system’s ability to survive failures by simulating potential errors (aka failure modes) in a series of controlled experiments. These experiments typically consist of four steps:

  1. Define the system’s normal behavior — its “steady state” — based on measurable output like overall throughput, error rates, latency, etc.
  2. Hypothesize about the steady state behavior of an experimental group, as compared to a stable control group.
  3. Expose the experimental group to simulated real-world events such as server crashes, malformed responses, or traffic spikes.
  4. Test the hypothesis by comparing the steady state of the control group and the experimental group. The smaller the differences, the more confidence we have that the system is resilient.

Or, to put it in less scientific terms: intentionally break things, compare measured with expected impact, and correct any problems uncovered this way.

As an example, let’s say you want to know what happens if, for some reason, your MySQL database isn’t available. You hypothesize that, in this case, your web application would stop serving requests, immediately returning an error instead. To simulate the event, you block access to the database server. Afterwards, however, you observe that the app seems to take forever to respond. After some investigation, you find the cause — a misconfigured timeout — and fix it in a matter of minutes.

As this example demonstrates, Chaos Engineering makes for effective resilience testing. Besides, it’s a ton of fun, but read on.

How to get started

Netflix went the extra mile and built several autonomous agents, so-called “monkeys”, for injecting failures and creating different kinds of outages. For example, Chaos Monkey randomly terminates virtual machines, Latency Monkey induces artificial delays in API calls to simulate service degradation, and Chaos Gorilla is programmed to take down an entire datacenter. Together they form the Simian Army.

While the Simian Army might be a novel concept, you don’t need to automate experiments to run continuously when you’re just getting started. Besides, it is best to introduce a company to the concept of Chaos Engineering by starting small.

So rather than wreaking havoc on your production system from day one, start by experimenting in an isolated staging environment (if you don’t have a pre-production environment yet, now would be the perfect time to create one). While the two environments are likely to be different in more or less subtle ways, any resilience testing is better than no resilience testing. Later, when you feel more confident, conduct some of the experiments — or preferably all of them — in production. Remember: Chaos Engineering is focused on controlled failure-injection. You make the rules!

The purpose of our chaos experiments is to simulate disaster conditions. This might sound like a difficult task — and yes, it does require a lot of creativity — but in the beginning it’s easiest to focus on availability, or rather the lack of it: inject failures so that certain pieces of your infrastructure become unavailable. Intentionally terminate cluster machines, kill worker processes, delete database tables, cut off access to internal and external services, etc. This way you can learn a lot about the coupling of your system and discover subtle dependencies you would otherwise overlook.

Later on you might want to simulate other events capable of disrupting steady state, like high latency caused by slow network performance. These experiments are generally harder to pull off and often require special tooling, but the takeaways are worth the extra effort.

Whatever you decide to do, you’ll be surprised how much you can learn from chaos.

Plan, execute, measure, adjust

Briefly, here are the steps involved in conducting chaos experiments. This list is based on my own experience in participating in GameDay events at Jimdo.

  1. Start by planning the experiments. Compile a list of potential failure modes, how you want to simulate them, and what you think the impact will be. I recommend using a spreadsheet.
  2. Pick a date. Inform stakeholders of affected systems, especially if you anticipate any trouble for customers.
  3. Gather the whole team in front of a big screen and go through the experiments together. This is the best way to transfer knowledge and develop a shared mindset.
  4. After each experiment, write down the actual measured impact.
  5. For each discovered flaw, put together a list of counter measures. Don’t implement them right away! Add any follow-up items to your issue tracker.

Make sure to repeat the experiments on a regular basis (at least once every quarter) to detect regressions as well as new problems. Don’t forget to bring your spreadsheet.

I hope you enjoyed this introduction to Chaos Engineering — a powerful, if somewhat radical, approach to building resilient systems.

As always, the best way to internalize new concepts is by practice. Therefore, start running your own chaos experiments! It’s well worth it.

This article first appeared on the Production Ready mailing list. Sign up here to get free updates delivered to your inbox.