Chaos Engineering: Necessary, but Not Sufficient

Published in

Gremlin

8 min readNov 20, 2023

I love the ideas behind Chaos Engineering. I’ve been championing them for the last fourteen years. So it pains me to write this article, but I feel like the truth needs to be out there: Chaos Engineering by itself isn’t enough.

Don’t get me wrong. Fault Injection is a valuable technique. It’s one of the many tools we need as engineers to build and operate complex distributed systems. Knowing how a system fails, knowing where the sharp edges are helps us ‌practice and adapt to real-world situations. By performing this testing before the failure has occurred, we can anticipate and address many potential issues before they become problems.

Wait, Fault Injection? Don’t you mean Chaos Engineering? Well, that leads us to one of the limitations of Chaos Engineering.

Chaos Engineering was created for the cloud

That name “Chaos Engineering” was originally chosen because it was exciting — and it fit Netflix’s approach back in the early 2010s.

Back then, Netflix made a big push to move everything into the cloud. One aspect of the cloud (especially in the early days) is that your servers were ephemeral, i.e. they could be replaced at any time. Prior, a lot of engineers relied on having the same server by storing state locally, writing to disk, etc. This caused a lot of surprise outages when a host disappeared and the replacement wasn’t fungible.

In a brilliant move by Netflix leadership, they decided to make this the engineer’s problem by introducing Chaos Monkey. If a host might disappear at any time, let’s ensure that it happens in staging and in production! This forced engineers to adapt to the new environment, stubbing their toes while ultimately building more resilient code.

We can see why Chaos Monkey and, hence, Chaos Engineering are a good fit for this use case. Netflix’s management decided to introduce chaos to mimic the real world and force their engineers to build for the chaos in their systems.

But in doing so, they also introduced another concept. And that idea is still essential to reliability efforts.

It’s not about chaos — it’s about reliability

A lot of teams don’t want to introduce chaos into their environments. It scares them. They know things will fail — after all, they’re already failing.

To many, Chaos Engineering is like a Western movie — a lot of shooting from the hip and seeing where things lie when the dust settles. And so those of us that advocate for this practice start from a position of weakness. We’re viewed as agents of mayhem, out to cause pain and hoping to luck upon something valuable along the way, instead of smart engineers making calculated decisions and using precision tools to verify the system.

But what Netflix leadership also provided when they introduced Chaos Monkey was permission to work on this problem. By showing engineers it was an important investment of time, they allowed engineers to think about and work on reliability problems, not just shipping features.

And Netflix wasn’t alone. When I was on the reliability team at Amazon, our status emails went straight to Jeff Bezos. The Amazon team had figured out the dollar cost of reliability outages, and made it a priority with the CEO’s attention. Was it intimidating? Hell, yes. But it also meant that when we found something wrong, we could get it fixed.

Most companies haven’t provided the signal that reliability is important to the business and to their customers. Therefore using Chaos Engineering to cause failures without the means to correct them is just reckless, inflicting harm without any benefit.

For ‌teams just beginning this journey, they don’t want chaos. They want testing. They want to systematically verify their understanding of their systems. They want things to behave as they expect, and they want to know if those expectations are invalid.

We don’t need a lot of chaos to facilitate this. Instead, we should provide these teams with a set of tools to inject failure. We should teach them the potential side effects of different experiments, how to scope it to the right impact, and help them interpret the results. All of this is about mitigating risk, not introducing chaos.

Chaos Engineering is one tool in your kit

Chaos Engineering is an incredibly effective practice for finding specific risks and verifying specific behaviors. Which is great, but can make it hard to look at the big picture. It’s a wrench in the toolbox when we are discussing how to build a semi-truck.

There’s nothing better for answering specific questions like, “What happens when foo dependency fails?” or “Can we handle an extra 100ms of latency at our proxy layer?” But it doesn’t answer the question that the business really cares about: “Is our system reliable and will it work when we need it most?” Or the questions end customers care about, like, “Can I buy my groceries?” or “Can I pay this bill?” or “Will my flight be on time?”

Truth be told, the problems of reliability have grown beyond the scope of Chaos Engineering alone. And that’s a good thing! But it does mean we need to expand our approach to encompass everything reliability means to a business.

Reliability programs are missing structure

So how should we think about and evaluate the reliability of our systems? Fortunately, we don’t have to reinvent the wheel and can look at other industries for inspiration.

Security

Reliability risks and security vulnerabilities have a lot in common. A security vulnerability doesn’t mean you’ve been compromised and there’s an active breach — just the potential for one. Same with reliability risks. Sure, everything’s working now, but that risk could cause an outage at any moment.

So we can look at a couple best practices for security:

Vulnerability analysis — Are we scanning for known vulnerabilities? Are we validating that we remain resilient to past vulnerabilities?
Pen Testing — Are we checking what happens when unexpected inputs are provided to our systems? (We tend to think about this in terms of fuzz testing, but what about unexpected load, i.e. DDOS?)

QA

When you think of testing, where do you go? Probably QA testing. Let’s extend a couple of those ideas to testing for reliability of the system, not just the code you’re about to ship:

Code Coverage — Have we tested all the things? If we don’t have time to test everything, have we tested the most important interactions?
Integration/Smoke testing — Are we testing the boundaries between our services? Are we validating our contracts? (In this case, SLAs and error conditions.)

When it comes to reliability, vulnerability analysis and code coverage are similar to verification testing (like the pre-built Reliability Management Test Suite at Gremlin where you check your systems and code for known failure modes). On the other hand, pen testing and integration/smoke testing have more in common with exploratory testing where you look for the unknown failure modes — a specialty of Chaos Engineering.

These testing capabilities are then what we would consider part of a good reliability program: a set of processes meant to inform the business as to the current risks, past performance, and future mitigations that need to be taken in order to continue to provide a high quality experience.

But for this to be successful, it needs teeth. It’s that old idea of trust, but verify. We have a lot of trust and very little verification in this part of our industry. Teams are mostly left on their own to deal with reliability issues, and the most important matter at hand is to triage and restore service, regardless of how much duct tape and bailing wire is used.

SREs can’t do it all by themselves

Over the past few years, the hip solution has been to throw bodies at the reliability problem in the form of SRE/Ops teams. This approach has a few shortcomings.

First, it scales with headcount, so isn’t an efficient software answer. Second, SREs can only be experts in so many systems, so often they’re inheriting or operating on ‘Black Boxes.’ As a result, they have limited control. Sometimes they’re able to influence engineering teams to change their practices (e.g. more testing in the CI/CD pipeline, better error handling, circuit breakers, backoffs, etc.) and sometimes they’re left without the leverage or influence to ask those teams to make any code changes.

This is both a misalignment of concerns and control, leaving most of these teams as a glorified rebranding of the sysadmin role of days past.

We’re actually running into a deficiency with the SRE model. Don’t get me wrong: SREs are good at their job and have done wonders with what they’ve been given. But that’s the problem. They haven’t been given the organizational tools and authority to really make a huge difference that can work at scale.

What’s really missing is top down attention directly from the CTO (or other highest technical leader). Too often leadership hires a bunch of SREs, checks the box off, then forgets about reliability until there’s an outage. Under those conditions, how can an SRE get Engineering time to actually fix the problems they find?

When a CTO makes reliability a priority (and lets everyone know it by holding people accountable), it tells the business that software development cares deeply about the quality of the software that’s written. More importantly, it sets the expectation of highly reliable software. So when a reliability leader surfaces a risk, engineering knows that they should work on it.

Because the CTO is watching.

When I first joined Amazon, I remember clearly being told, “Every great engineer owns the performance, availability and efficiency of their code. That isn’t some other team’s problem. It’s yours.” This told me early on that, yes, I should account for these problems. If I wrote poor quality code and it failed, I’d be held accountable. And it’s stayed with me to this day.

So yes, you need someone like SREs to own the testing and processes, but you also need engineers and team leaders to be accountable for the reliability of their systems. And you need leadership to set the standards for reliability so time and resources are actually spent remediating reliability risks. Once the standard has been set, they need tools to have visibility into these risks, the coverage, and the reliability of the system.

Chaos Engineering is necessary, but not sufficient

Chaos Engineering was designed to uncover the risk. And it’s still very good at that. But it isn’t a process for tracking and reporting on risk in a system.

You need both to improve reliability.

The engineer needs a tool and methodology to run experiments, verify behavior and uncover risks. At the same time, the CTO needs the metrics that show them how many risks exist in each deployed system, how many tests were run last week, and whether we’ve fixed what we previously found.

Without pairing tools with insight, we fall back to hope as a strategy. Without visibility, without accountability, without recognition for the good work done, teams won’t invest the time. And without doing the work, things will break, customers will be disappointed and ultimately, we’ll be back to discussing how we can get out of firefighting mode.

Which is the whole reason Chaos Engineering was created in the first place.