Chaos Engineering — Part 1

The art of breaking things purposefully

Adrian Hornsby
Jul 1 · 15 min read

Why?

When a firefighter goes under live-fire conditions, she/he needs to have an intuition for the fire they are fighting against. To acquire that lifesaving intuition, she/he needs to train hour after hour, after hour. Like the old adage says, practice makes perfect.

Image source

Once upon a time in Seattle

In the early 2000s, Jesse Robbins, whose official title was Master of Disaster at Amazon, created and led a program called GameDay, a program inspired by his experience training as a firefighter. GameDay was designed to test, train and prepare Amazon systems, software, and people to respond to a disaster.

“GameDay: Creating Resiliency Through Destruction” — Jesse Robbins

The rise of the monkeys

You’ve probably heard of Netflix — the online video provider. Netflix began moving out of their own datacenter into the AWS Cloud in August 2008 — a push stimulated by a major database corruption that affected shipment of DVDs for three days. Yes, Netflix started by sending movies over the traditional snail mail. The migration to cloud was driven by the need to support a much larger streaming workload and also to move from a monolithic architecture to a micro-services architecture that could scale with both more customers and a bigger engineering team. The customer facing streaming service was moved to AWS between 2010 and 2011, and corporate IT and everything else eventually moved, with the datacenter being closed in 2016. Netflix measures availability as successful customer requests vs. failures to start streaming a movie, not as simple uptime vs. downtime, and targeted and often achieved four nines of availability in each region on a quarterly basis. Their global architecture spans across three AWS regions and they can move customers between regions in the event of a problem in one region.

“Failures are a given, and everything will eventually fail over time.” — Werner Vogels

Indeed, failures in distributed systems, especially at scale, are unavoidable, even in the cloud. However, the AWS cloud and its redundancy primitives — in particular the multi availability zone design principle on which it is built — allows anyone to build highly resilient services.

By running experiments on a regular basis that simulate a Regional outage, we were able to identify any systemic weaknesses early and fix them — Netflix blog

Chaos engineering principles have now been formalized and the following definition given:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/

However, in his AWS re:Invent 2018 talk on chaos engineering, the former Netflix cloud architect, Adrian Cockcroft, who helped lead the company’s shift to an all-cloud computing infrastructure, presents an alternative definition of chaos engineering that, in my opinion is a bit more precise and boiled down.

“Chaos Engineering is an experiment to ensure that the impact of failures is mitigated”

Indeed, we know that failures happen all the time, and that they shouldn’t impact customers if they are mitigated correctly — and chaos engineering’s main purpose is to uncover failures that are not being mitigated.


Prerequisites to chaos

Before starting your journey into chaos engineering, make sure you’ve done your homework and have built resiliency into every level of your organization. Building resilient systems isn’t all about software. It starts at the infrastructure layer, progresses to the network and data, influences application design and extends to people and culture. I’ve written extensively about resiliency patterns and failures in the past (here, here, here and here) so I won’t expand on that here, but the following is a little reminder.

Some must-have items before introducing chaos (list not exhaustive)

The Phases of Chaos Engineering

It’s important to understand that chaos engineering is NOT about letting monkeys loose or allowing them to break things randomly without a purpose. Chaos engineering is about breaking things in a controlled environment, through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.


The Phases of Chaos Engineering

1— Steady State

One of the most important part of chaos engineering is to first understand the behavior of the system in normal conditions.

Measure, Measure, and Measure Again.

It goes without saying that if you can’t properly measure your system, you can’t monitor drifts in the steady state or even find one. Invest in measuring everything, from the network, machine, application and people’s level. Draw graphs of these measurements, even if they aren’t drifting . You’ll be surprised by correlations you never expected.

2 — Hypothesis

Once you’ve nailed your steady state, you can start making your hypothesis.

Make it everyone’s problem!

When doing the hypothesis, bring the entire team around the table — YES! Everyone — the product owner, the technical product manager, backend and frontend developers, designers, architects, etc. Everyone involved in one way or another with the product.

3 — Design and run the experiment

Docker stop database in action

How many customers are affected?
What functionality is impaired?
Which locations are impacted?

Try to have an emergency stop button or a way to stop your experiment to get back to the normal steady state as fast as possible. I love conducting experiments using canary deployment which is a technique used to reduce the risk of failure when new versions of applications enter production, by gradually rolling out the change to a small subset of users and then slowly rolling it out to the entire infrastructure and making it available to everybody. I love canary deployment simply because it supports the principle of immutable infrastructure and it is fairly easy to stop the canary experiment.

Example DNS-based canary deployment for chaos experiments

4 — Learn and verify

In order to learn and verify, you need to measure. As stated previously, invest in measuring everything! Then, quantify the results and always, always start with measuring the time to detect. I’ve lived several outages where the alerting system failed and customers or Twitter became the alarm … and trust me, you don’t want to end up in that situation — so use chaos experiments to test your monitoring and alerting systems as well.

Time to detect?
Time for notification? And escalation?
Time to public notification?
Time for graceful degradation to kick-in?
Time for self-healing?
Time to recovery — partial and full?
Time to all-clear and stable?

Remember that there’s no isolated cause of an accident. Large incidents always result from several smaller failures that add-up to create larger-scale events.

5 — Improve and fix it!

The most important lesson here is to prioritize fixing the findings of your chaos experiments over developing new features! Get upper management to enforce that process and buy into the idea that fixing current issues is more important than continuing the development of new features.

The benefits of chaos engineering

The benefits are multiple, but I’ll outline two which I think are the most important:


Adrian Hornsby

Written by

Observer of the world, climber of the rocks. Sr. Tech Evangelist at AWS. Opinions are my own.