Chaos Monkey for Fun and Profit

https://github.com/Netflix/SimianArmy

This article will pick up where Chaos Engineering 101 left off and cover a slightly more advanced principle of Chaos Engineering: automating chaos experiments.

The core idea of Chaos Engineering, as we recall, is to inject failures proactively in a controlled manner in order to gain confidence in our systems. Chaos Engineering enables us to verify that things behave as we expect — and to fix them if they don’t.

In Chaos Engineering 101, I argued that you don’t need to automate chaos experiments when you’re just getting started. I still think that manual testing, which can be as simple as terminating a process with the kill command, is the easiest way to get familiar with the concept of fault injection and to gradually establish the right mindset. At the end of the article, I explained how Jimdo runs GameDay events, which are typically based on manual fault injection as well.

The next level of chaos

The Principles of Chaos Engineering, as formulated by Netflix, currently list four advanced principles of chaos. The document says:

The [advanced] principles describe an ideal application of Chaos Engineering […] The degree to which these principles are pursued strongly correlates to the confidence we can have in a distributed system at scale.

The one principle we’re interested in today is described as follows:

Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.

In other words, the Principles suggest automating experiments (that used to be manual) to run continuously in order to further increase confidence in our systems.

Fortunately, Netflix does not only tell us what to do; they also gave us a mighty tool for putting theory into practice: Chaos Monkey.

Chaos Monkey

Netflix went the extra mile and built several autonomous agents, so-called “monkeys”, for injecting failures and creating different kinds of outages in an automated manner. Latency Monkey, for example, induces artificial delays in API calls to simulate service degradation, whereas Chaos Gorilla is programmed to take down an entire AWS availability zone. Together these monkeys form the Simian Army.

Chaos Monkey is the most famous member of Netflix’s Simian Army. In fact, it’s the first, and to this date only, monkey of its kind that is publicly available. Broadly speaking, Chaos Monkey randomly terminates EC2 instances in AWS. Here is a more thorough description from the Netflix blog:

[Chaos Monkey is] a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.

The post continues:

By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.

Next, I’ll show you how to run your very own Chaos Monkey.

The Simian Army — Docker Edition

I spent a couple hours and dockerized the Simian Army, a Java application with dozens of settings, to make it as simple as possible to use Chaos Monkey. The result is a highly configurable Docker image which, I hope, provides a sound basis for automating chaos experiments.

As an example, this command will start a Docker container running the Simian Army and instruct Chaos Monkey to consider all auto scaling groups (ASGs) in the given AWS account for termination:

docker run -it --rm \
-e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
-e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
-e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
-e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \
-e SIMIANARMY_CHAOS_ASG_ENABLED=true \
mlafeldt/simianarmy

This example is safe to run as Chaos Monkey will operate in dry-run mode by default. It’s a good way for getting a feeling of the application without taking a risk.

The second example is more realistic and could very well be your first chaos experiment to run continuously. This time, Chaos Monkey will randomly terminate instances of the auto scaling groups tagged with a specific key-value pair:

docker run -it --rm \
-e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
-e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
-e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
-e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \
-e SIMIANARMY_CHAOS_ASG_ENABLED=true \
-e SIMIANARMY_CHAOS_ASGTAG_KEY=chaos_monkey \
-e SIMIANARMY_CHAOS_ASGTAG_VALUE=true \
-e SIMIANARMY_CHAOS_LEASHED=false \
mlafeldt/simianarmy

Note that this command will actually unleash the monkey. But don’t worry: you still need to tag your ASGs accordingly for any instances to be killed.

There are many more configuration settings you can pass to the Docker image, including ones to control frequency, probability, and type of terminations. Also, you can (and should) configure Chaos Monkey to send email notifications about terminations. I encourage you to read the documentation to learn more.

As always, it’s a good idea to start small. I strongly recommend testing Chaos Monkey in a staging environment before unleashing it in production.

This article isn’t meant to be a comprehensive guide on operating Chaos Monkey in production. However, I want to at least mention that observability, through monitoring and other means, is very important when it comes to chaos experiments, even more so when they’re automated. We want to eliminate customer impact as quickly as possible.

Manual vs. automated fault injection

Now that running Chaos Monkey is only a single command away, should we stop manual testing altogether?

The answer is hell, no!

Chaos Monkey is a useful tool to discover weaknesses in your systems caused by various kinds of instance failures, failures that you’d otherwise have to inject manually (or by developing your own tools).

GameDay events, on the other hand, bring the whole team together to think about failure modes and conduct chaos experiments, which is ideal to transfer knowledge and foster a shared mindset. It might also be the only way to test more complex scenarios that are hard or impossible to automate.

That being said, both manual and automated fault injection are valuable — and both have limitations. I will continue to apply these two approaches and share my experience with you.


Liked it? Get more articles like this one delivered to your inbox by subscribing to the Production Ready mailing list.