The Language of Chaos Experiments in Chaos Toolkit

Working Towards a General Purpose Declarative Language for Chaos Engineering Experiments

Chaos Engineering is emerging as an incredibly valuable discipline as it tends to encourage technical discussions as well as being an eternal source of real-world field stories. Who hasn’t got a story of a failed release in production or a major failure due to a small should-not-have-broken-anything change? These painful tales are the bread-and-butter of chaos engineering.

A Short History of the Chaos Toolkit

“We were trying to define a common, declarative API for chaos engineering”

Chaos Engineering was initially popularised by Netflix through the use of its open source Simian Army. Netflix showed how it was possible to build confidence in a complex, ever-changing platform in order to ultimately ensure a smooth, high-quality user experience.

Whenever Sylvain and I encounter these tales of woe from our friends, colleagues and even from the larger community that attend our talks, we found ourselves easily translating these painful events into great materials for Chaos Engineering experiments. This is not to say that Chaos Engineer would help these people look into the future and avoid all nasty outcomes, but instead because it is a discipline that helps us all explore and learn about our system in terms of its weaknesses and, hopefully, find responses to those weaknesses.

As the discussion evolved we realised we wanted to crystallise ways of exploring our system. Obviously, the Chaos Engineering book from the Netflix crew is a great starting point, but we wanted to explore things in code a bit deeper.

This is where the free and open source Chaos Toolkit started its life. We thought it was critical that Chaos Engineering should go beyond Chaos Engineers. Indeed, it is a systemic question and all engineers should be able to make sense of experiments and their outcomes.

To meet this ideal we decided early-on to use a declarative approach to defining experiments so that we could help engineers share and focus on the goal of learning about the system, as well as take these chaos experiment declarations and be able to drive them against any number of chaos tooling integrations.

We were trying to define a common, declarative API for chaos engineering, no small task but a worthy goal as you can see how tools such as Terraform aid in working with multiple cloud environments, we wanted the same ease of working with multiple chaos engineering environments and tools in a consistent and human-readable way.

The trick was that we needed to decide on what an experiment could be actually declaratively made up of. What concepts were needed? Was there any general-purpose approach that could be applied at all? Were we trying to do the impossible…

From Concepts to (Declarative) Code in Chaos Experiments

The first decision we made was that the chaos experiments would be declared in something that was implementation-language agnostic. Taking inspiration from tools such as Terraform and Kubernetes, we decided to use JSON for our experiment definitions. That’s not to say that other people couldn’t take the same concepts and express them in pretty much any language they like, but to then drive the Chaos Toolkit with those custom definitions the translation would need to be made into JSON for our usage.

Now that we had a format it was time to look at what concepts needed to be expressed in that format.

Version, Title, Description & Tags

Chaos Toolkit experiments state with a Version, Title, Description and some Tags

The first concept that we needed to capture was that of a Version in our chaos experiments. This is not the version of the experiment itself, rather the version of the language that Chaos Toolkit understands.

We expect to evolve our declarative approach and while we will constantly strive to make sure that all backwards compatibility can be maintained, there will at times be breaking changes to the language and we wanted users of the toolkit to be able to specify exactly when language version they expressed their chaos experiments in.

Next comes a human-readable Title that explains some of the reason for the experiment. Further explanation, and maybe even some references to incident reports perhaps, go in the Description field, before the whole block is closed off with some Tags that can be used to simply label the experiment for indexing purposes.

Defining Steady State

The next step is to define your “Steady State Hypothesis” for the experiment. This is, understandably, defined using the steady-state-hypothesis block:

Define what “healthy”, “normal”, or “expected” looks like using a Steady State Hypothesis

Using a number of probes, the steady-state-hypothesis defines what “normal and healthy” looks like in our system. The stead-state-hypothesis is a crucial concept as, to make sense of any impact we may make throughout our chaos experiment, we needed to have a good initial take on what the system should maintain itself in the face of that stress.

As such it is the steady-state-hypothesis that defines how the system must look in order to execute the experiment, and also what will be analysed when the experiment’s method has been completed to see what effects have been discovered that may have deviated from that hypothesis.

The Experimental Method

Next comes the experimental method itself where actions are executed to cause real-world events that will then be analysed against the steady-state-hypothesis once the method is concluded.

One of the things that we’ve noticed is that we often end up putting a probe or two in alongside the action's in our experiments. These probes are just sourcing additional data for the experiment’s report, they are not being used for anything conditional in terms of the experiment’s method.

So why have these probes? Isn’t the steady-state-hypothesis enough? What we’ve seen is that often when an experiment is executed there are sometimes surprising, and insightful, impacts outside of the remit of the experiment. These are real learnings but, until we know more, we can’t just make them part of the steady-state. In those cases we often add those new findings as probes to the experiment so that we can source further data on them as we continually run the experiment. Eventually some of those probes can be characterised as contributing to the stead-state of the system and we promote them into the steady-state-hypothesis, but only if they really are representative of what the steady-state across the system should look like.

Finally, Rollbacks

A well-behaved experiment should, it can be argued, put things back the way they were when the experiment is concluded. This is the job of the rollbacks section of the Chaos Toolkit’s declarative experiment format. It is an optional collection of, typically, action blocks that can be executed to rectify any impacts that have been made during the experiment’s method, assuming you actually want to put things back the way they were… In our experience there are times when not putting the system back into the original state can elicit even further learnings over time, so this is one reason why you might not want to have any actions in your rollbacks section.

And that’s it, that completes the current version of our Chaos Toolkit experiment’s declaration. A full experiment can be found in the growing Chaos Toolkit samples project and is shown below:

A Chaos Experiment Language and Toolkit for Everyone

“we want your input more than ever…”

The aim of the developing the Chaos Toolkit was to make it as easy as possible for people to try out chaos engineering with some automated experiments. Then, once they had seen the value in the approach they could then scale up their integration through the Chaos Toolkit to even larger and more powerful mechanisms of inducing system failure and learning from it through tools such a Gremlin and Pumba amongst a growing list of others.

Chaos Toolkit would grow with you and your Chaos Engineering capability as you go further and further to gain confidence in your distributed systems.

To make this vision a reality we wanted the toolkit to truly be a community-driven project, and so we wanted to make sure that the maximum number of people could define, read and comprehend chaos experiments. That’s what’s at the heart of this latest version of the Chaos Toolkit’s declarative language for defining chaos experiments.

But we’re not done yet. Although the current version of Chaos Toolkit’s declarative language for experiments is being used by a number of company;’s to build real-world chaos experiments, we still want your feedback on the current experiment declaration language, either in the form of new issues on the Chaos Toolkit project, or coming to chat with us on our Chaos Toolkit slack team, or even just asking questions as comments on this blog post!

Really delivering on our commitment to make Chaos Toolkit the open and free API to Chaos Engineering means we want your input more than ever. So there’s never a better time to get involved and start a conversation with us.

Going beyond the language of chaos experiments, in a future blog post I’m going to explore the different ways you can extend the toolkit for your own chaos experiment’s actions and probes, turning the Chaos Toolkit into a truly open tool for your own bespoke, chaos engineering needs.