Chaos Engineering for the Business

6 Tips for how to explain Chaos Engineering to non-technical stakeholders

Russ Miles
Russ Miles
9 min readJan 8, 2018

--

“We’re going to do chaos engineering” … possibly the last phrase you’d use in a brief career as a CTO or technical leader in your company.

You’ve been reading up and listening to clever folks talk about chaos engineering and you’ve come to the conclusion that it makes some sense for your company. You’ve jumped the hurdles to get yourself as a bullet item in the monthly board meeting, and now you’re standing in front of a room full of forbidding faces and you’ve dropped the genius on them … what could possibly go wrong?

I’ve been there, in this very position. I’ve lost count of the number of banks, for example, that I’ve spoken to as part of my microservices, cloud native or chaos engineering work where I’ve used variations of “We’re going to do chaos engineering…”, including:

“We should do chaos engineering…”

“Chaos engineering is really helpful, we should consider doing it…”

And my personal favourite, because it gets the strongest response:

“Netflix and others do this, we should look at chaos engineering too…”

The problem? The responses are not good. Far from good in fact. They tend to be downright hostile!

“We don’t want to do chaos!”, “Chaos! Why would we want THAT?!”

“Nooooo way, not again, we barely got through the chaos of 2008…”

Chaos engineering is a great discipline and I’ve seen, through our clients and customers, real benefits to the business. The trick is that although I really love the fact we now have a term for it, it’s been my experience that the term often rubs the business up the wrong way. Some other terms don’t help either… If you like fun and interesting conversations with your bosses, try one of the following:

“We’re going to inject failure into production…”

“We’re going to break things…”

“Umm, I already ran Chaos Monkey in production, and guess what…” — (I wish this one was just a joke…)

If you’re lucky, the response is merely a dubious and confused “Why…?”. If you’re unlucky it’s time to get back in touch with your recruiters…

I want to help you avoid this problem entirely. Here I’m going to explain how you can talk to the business about chaos engineering and get them as excited about the benefits, and knowledgeable about the reasonable limitations, of this discipline. If you are the non technical business folks, then hopefully this post alone will be enough to convince you that at least learning more about chaos engineering might be of value to your company’s IT strategy.

Let’s start with the first challenge to business people when it comes to chaos engineering, the term itself…

Tip №1: Don’t be afraid to drop the term

Yes, you heard that right; don’t be afraid to drop the term itself.

The problem with the term “chaos engineering” is that it wonderfully describes the challenge we’re working with (i.e. “The facilitation of experiments to uncover systemic weaknesses” in complex and chaotic systems) but at best it misses the point on the benefits of the approach, and at worst it scares the bejesus out of non-technical stakeholders!

“If we drop the term what can we say instead?”, I hear you say… Well, I’ve tried a lot of different options, but I can honestly say that I have found one phrase that seems to work really well, and first I’ll explain why.

First of all I only use this term temporarily while people get their head around the benefits. So it’s not a replacement for chaos engineering, just a convenient way of initially presenting what can be done.

Second it’s usually familiar to traditional business with business-enabling-and-critical IT systems. In fact, if it’s not familiar then they may need things even more badly!

Thirdly it emphasises safety, not breaking things. The term talks about limiting the scope of things, which makes it intrinsically seem less of a big deal and so safer to approach.

So what is this magical term? I give it to you now and hope it is as much use to you as it has been to me when I’m facing a board of directors at a bank and trying to convince them that a small investment will pay big dividends. The term is:

“Limited scope, continuous, disaster recovery”

“Limited scope” tells the business that this is a carefully considered, and constrained, activity that we’re going to do. We’re not going to be just running around breaking things randomly. Although some experiments will do that in a limited fashion, that’s the “how” and is not the point when you’re introducing the technique. We’re professionals and we’re going to carefully perform some activities that have value.

How often are we going to perform these activities? “Continuous” is the answer. “Continuous” emphasises that this will be an ongoing, potentially background and automated, activity. That just leaves the payback…

“Disaster recovery” helps non-technical, and sometimes even technical, people understand that this is about dealing with important circumstances that will occur in their system. This is the payback and the scope of what we are addressing. Small disasters will happen, and we’re going to preemptively explore how our whole sociotechnical system deals with them (see Tip № 4) so that it can improve.

If you’re introducing chaos engineering to a company, especially I’ve found to a financial institution, then I recommend considering the term “Limited scope, continuous disaster recovery” to get an early, positive response.

Tip №2: It’s about Confidence, not Breaking Things

Chaos engineering has a problem, and it’s called Chaos Monkey. Chaos Monkey is a victim of having a cool name and being a very successful tool that is often very misunderstood, and that misunderstanding can easily reflect back on chaos engineering too.

The misunderstanding I’m talking about is thinking that chaos engineering is only about breaking things. While exploring how production, or the entire sociotechnical system that surrounds it, reacts to breakages is certainly one useful technique in a chaos engineers toolbox of potential experiments, it’s far from the only one!

The tip here is to make sure you don’t confuse the method with the payback. Chaos engineering’s payback is in confidence in your system as you explore the second order ignorance, the unknown unknowns, inherent in any complex and chaotic system that may, and often does, fail.

Chaos engineering is about discovery, learning and improving a system in terms of how it reacts to the inevitable stresses of day-to-day usage and system evolution. It’s about looking at that improvement from the perspective of system, and UX, availability. That’s the payback. Breaking things, carefully, is just one technique to build that confidence. Talking of being careful…

Tip №3: Put Blast Radius and Learning Front and Centre

In Tip №1 I coined the term that I use for chaos engineering, at least initially to newcomers, as “Limited scope, continuous disaster recovery”. There’s a reason that I put “Limited scope” first.

There’s a founded fear that if someone gives approval to doing chaos engineering the next thing that will happen is that Chaos Monkey will be destroying things randomly all over production. This has happened, with understandably negative effects, so it’s a quite reasonable fear for people to have.

The tip here is that you need to put that to rest. First, you need to emphasise the concept of “blast radius”, which is a limited scope.

When planning any chaos experiment you will need to carefully consider how far you need to go to learn something useful about the system. More often than not, this impact might be small and on a non-critical part of the system. Limiting the effects to just that area is the blast radius of the experiment. Knowing and communicating this blast radius, this limited scope, is crucial to being able to convince anyone that you know what you’re doing when running an experiment in production.

There’s another important side point here. If you already know that your system will not handle a particular experiment, such as the “Chaos Monkey all over production randomly” scenario I mentioned earlier, then don’t do the experiment! Limit the scope, and the blast radius, to Zero! You can jump straight to step 3 as you already have discovered and learned and it is now time to improve the system (or decide if you need to or not).

You need to tell people that you’re limiting the scope of the impact of your experiment as much as possible, while still retaining the opportunity to discover, learn and improve the system. This learning loop is, after all, a big payback of chaos engineering.

Tip №4: Not just about Infrastructure, or even just the Technical

It’s easy to pigeonhole chaos engineering as being “Just something the technical people do to infrastructure”. If all you’ve seen is the Chaos Monkey, then you’d be forgiven for making this mistake.

But make no bones about it, it is a mistake to limit the potential benefits of chaos engineering to just infrastructure, or even just the technical aspects of how software gets delivered and is run.

Chaos engineering addresses all aspects of the sociotechnical system of software development.

Chaos working at multiple levels of the sociotechical system

It can be used to improve the infrastructure, platforms, applications and people and practices, not just the infrastructure. This is one of the reasons I think that chaos engineering is the underpinning mindset and discipline that enables antifragile software development sociotechnical systems. Game Days and chaos engineering experiments in particular can be used to learn about the weaknesses at all of these levels in the sociotechnical system, to learn from those weakness and then to improve the system holistically.

Tip №5: It doesn’t have to be a big, up-front investment

Are you worried that chaos engineering might be a new, big investment for your company? Worried that this might be yet another “transformation” to hit the bottom line?

No need to worry. The fact is, you probably are already doing some aspects of the discipline already!

If you’re doing disaster recovery, or already have a team dedicated to watching SLAs, system availability in production, or even people who occasionally put the system through its paces in terms of how it might respond, then you might already be getting some of the benefits of chaos engineering. Many organisations start there and grow the awareness and skills of those people into the more proactive, and generally more impactful, practices of chaos engineering.

With a little self-directed learning, maybe some consultancy and training budget, you can start exploring weaknesses in your sociotechnical system through low-cost Game Days and then even begin automating your own chaos experiments.

Worried that you might need a big budget for the tools to do chaos engineering? Not at all. For example, the Chaos Toolkit is free and open source and aims to make it easy to get started. There’s also a growing catalogue of interactive, online training courses that the community is actively working on to aid initial adoption of the technique. When you’re ready for further power through commercial offerings the Chaos Toolkit can be used to drive great tools such as Gremlin as well.

Tip №6: Know the benefits, know the limitations; don’t over-promise!

Finally it’s not necessarily all a bed of roses. As pointed out in the excellent recent article from Mathias Lafeld, it’s all too tempting when adopting a new technique or discipline such as chaos engineering to get over-excited about the potential benefits of the technique.

But there are downsides and limitations, as explained in Mathias’s article, so make sure you know those and, please, don’t over-sell chaos engineering. Chaos engineering brings a wonderfully valuable mindset and overall discipline to software development and how the whole sociotechnical system can be improved, but please let’s not turn it into yet another annoying buzz word through diluting and losing sight of the real benefits of the approach in vague promises of the “moon on a stick”.

Russ Miles is one of the founders of the free and open source Chaos Toolkit project and CEO of ChaosIQ that provide consultancy and training on all aspects of adopting and getting the real benefits of chaos engineering.

For more information on the Chaos Toolkit you can get started immediately with the interactive, online tutorials, read the project blog, grab the code, or even join the community on the project’s Slack.

--

--

Russ Miles
Russ Miles

People, Team and Organizational Developer. Writer, psychologist, speaker and humanistic Head of Engineering. https://twitter.com/russmiles