Chaos Engineering Traps
On March 28th, 2019, I presented at the 4th Chaos Community Day on ‘Chaos Engineering Traps’ that are easy to fall into.
It was a great event, as always, and I had incredibly rich conversations with people who attended and spoke at the event. These community days are not recorded, but I enjoyed writing and presenting the talk I gave and I wanted to share the slides and contents of my talk in written form here.
Thank you to Johan Bergström for his excellent video on “Three traps in accident investigation”, which was a source of inspiration for this talk.
Note: The slides that are ‘traps’ are preceded by ‘Trap’, the other slides contain stories and lessons learned. This was a 21 minute talk, about topics that naturally deserve more attention than that.
Chaos Engineering borrows ideas and theories from several different industries and ideas. We don’t talk about this enough. Industries and ideas such as: Resilience Engineering, simulation performance, learning science, Cognitive Systems Engineering, and what we refer to in Chaos Engineering as ‘gameday’ exercises.
I started off this talk by discussing one of the most famous chaos (simulation) experiments (no, not Chernobyl, though that is a great example):
Apollo 1, was the first crewed mission of the United States Apollo program. The Apollo program is well-known and responsible for landing first humans on the Moon.
The launch rehearsal test for Apollo 1 was planned as the first test of the Apollo command and service module with a crew set to launch on February 21, 1967. Unfortunately, the mission never actually flew.
A fire in the Command Module during the launch rehearsal test on January 27, 1967 killed all three crew members: Command Pilot: Gus Grissom, Senior Pilot: Ed White, and Pilot: Roger B. Chaffee. This test also destroyed the Command Module. The name Apollo 1, chosen by the crew, was officially retired by NASA in memory 3 months later.
The ignition of the fire in the cabin was contributed to by “vulnerable wiring carrying spacecraft power” and “vulnerable plumbing carrying a combustible and corrosive coolant”.
The rescue of the crew members was prevented by the plug door hatch (thought to keep them safe inside during flight), which could not be opened against the higher internal pressure of the cabin. (more on this later!)
Since the rocket was un-fueled, the test was not considered hazardous and it was thought that the ‘blast radius’ was effectively mitigated. As a result, the standard emergency preparedness for the test did not actually take place and was deemed unnecessary. Emergency teams such as fire, rescue, and medical were not notified of any risks in the test and were not in attendance.
This event was not only tragic, it embarrassed NASA. It was an experiment that went wrong and it’s a good reminder of the attention that should be placed on these experiments. This event ended up changing spacecraft design forever.
Not all of us in software are directly dealing with human lives, but we do all work at companies that are running businesses, ideally generating real revenue. Incidents can be detrimental and costly for a number of reasons. That fact, is part of why we practice Chaos Engineering in software, and use it to build confidence in our ability to react to failure.
I’ll be referencing this case (Apollo 1) throughout the rest of these slides.
Here is the official Accident Report from the test: https://history.nasa.gov/Apollo204/appendices/AppendixD12-17.pdf
A side note: thank you to all of the folks that inspire and motivate me in Chaos Engineering, Human Factors, Resilience Engineering, and Site Reliability.
I’ve had the opportunity to practice Chaos Engineering at 2 different companies (Netflix and Jet.com) and now I’ve recently started at another company where I’ll be doing similar things with Chaos Engineering (Slack). I’m also currently getting a Master’s degree in Human Factors and Systems Safety at Lund University — which has given me a neat perspective on the human side of Chaos Engineering.
Through this unique lens of seeing Chaos Engineering implemented in different companies, looking at systems safety in other industries, and talking to colleagues throughout the software industry that have implemented Chaos Engineering at their companies. I wanted to discuss 8 specific ‘traps’ of Chaos Engineering that I commonly see.
I realize saying “8 traps of Chaos Engineering” reads like a 2011 Buzzfeed article, so I’m going to lean into it by saying “you won’t BELIEVE trap number 3”.
This is a very common and natural way of measuring the “ROI” of Chaos Engineering. Unfortunately, it really doesn’t tell you much about how you are doing.
A reference from one of my absolute favorite papers, The Error of Counting Errors, by Dr. Robert L. Wears.
In this quote, you can easily swap out the word ‘errors’ for ‘vulnerabilities found during Chaos Engineering’ and you have the same effect. Just because you’re counting the vulnerabilities your tool, your team, or your outsourced product found, doesn’t mean that Chaos Engineering is a success.
I’ll share another quote from Wears in this very same paper:
“Error counts are measures of ignorance, rather than risk”
— The Error of Counting Errors, by Dr. Robert L Wears
If you’re implementing Chaos Engineering in an organization, you may find yourself in a situation where leadership is pushing counting vulnerabilities found as an “OKR” that you have to achieve. I was once on a team that had a similar goal imparted on us.
I mentioned this goal to a colleague outside the team, as I was hesitant about it and he responded with something along the lines of:
Counting vulnerabilities doesn’t really tell you much about the success of the chaos program. It can also lead you to optimizing for suboptimal paths and missing out on all of the other positive impacts Chaos Engineering can bring you.
This is a cheeky, illustrative book written by Darrell Huff in 1954, and it still holds up so well!
It illustrates errors when it comes to the interpretation of statistics, and how these errors may create incorrect conclusions that lead us to believe we are performing quite differently than the reality we are actually performing in. Buy it. Read it.
Resilience is the ability of a system to adapt so that it can sustain critical functioning in the face of changes and disturbances.
And when I say system, I don’t just mean the tooling. I mean the people too. Knowing how we react to situations in the face of these disturbances is part of Chaos Engineering, and it’s a part everyone should be aware of.
The humans, their attitudes, and the tradeoffs they’re making under pressure are all contributors into how the system runs and behaves. They are part of the system.
Chaos Engineering is not a means to an end for achieving resilience or “proving resilience”.
Let’s bridge Chaos Engineering and Resilience Engineering and discuss successes more as a result of the chaos experiments we run. We should focus on what we learned we are good at, not simply finding vulnerabilities and pointing out system risks. Understanding how things go right, is what leads us to developing a deeper understanding of how things go wrong.
This is difficult to execute in practice: getting time on everyone’s calendars. However, it yields much more benefit for all team members to be in a room together, with a facilitator, discussing their various mental models of the system under experiment.
In the Apollo 1 launch experiment, rescue, medical assistance, and crew escape teams were not actively involved with the experiment preparation.
If you’re involved in facilitating chaos experiments, or analyzing the results your goal is not to be a rule enforcer. Part of the success in Chaos Engineering is to build relationships (tools can aid this, but you can’t automate yourself out of this!) and to understand the tradeoffs that people making under pressure, how they develop expertise, how they distill expertise, etc.
We can easily get into this trap by being on “ops-like” teams. It’s not fun to chase vulnerabilities or bugs and just adding things onto people’s backlogs, and requiring completion isn’t helping. If you provide the context on the vulnerability, allow the experts (the service owners) to balance the tradeoffs between fixing a vulnerability, understanding it better, or ignoring it in favor of something more beneficial to the business.
When I was at Netflix, I helped develop a tool we called “Monocle”. Monocle provided configuration optics on services and dependencies that had the possibility of chaos experimentation (meaning the chaos tool supported the injection points for the dependencies listed). The goal with providing this insight to service owners was to give them key information about their services that might not be easily known or easy to determine otherwise. The tool provided users insights such as: the services’ configured RPC timeouts, and whether those timeouts were too low, or too high, with respect to other related metrics like: how long calls usually took, or the configured Hystrix timeout, or the retry logic of the service. Monocle provided a lot more than this, but let’s focus on timeouts for this example.
The yellow and red attention icons in the image above indicate anti-patterns around a specific configuration for the service.
We received a lot of positive feedback on showing people anti-patterns involved in their services. However, each time I showed Monocle to a new team that owned a set of services, I tended to receive a response from service owners that had the following flavor:
“If you know these timeouts are dangerous or ‘incorrect’, why can’t you automatically create a pull request to make them a bit more ‘appropriate’?”
This is a completely reasonable response, and technically, we could have done that. But, would it have been the right thing to do? Would the tool have set this timeout ‘well’? Automating timeout configuration would have certainly discounted a lot of human expertise in deciding what the timeout should be. For example, how can you automate away expertise on other ways a service is used that you can’t simply put into an algorithm? Such as: whether or not it makes sense to do a fast retry after a search call fails, or even more nuanced behavior: when it makes sense for a fallback to kick in.
Just because Monocle could indicate that a configured timeout had potential issues, doesn’t mean it should (or can even do better) in deciding what is a “good” or “beneficial to the user” configuration. Ultimately, a tool can provide the context on what makes a timeout incorrect (we did this via tooltips on the attention icons), however, you can get into a dangerous game when an algorithm is determining what the customer experience should be.
“…an operator will only be able to generate successful new strategies for unusual situations if he has an adequate knowledge of the process.”
“ Long term knowledge develops only through use and feedback about its effectiveness.”
— Lisanne Bainbridge, The Ironies of Automation (1983)
If we automate the timeout configuration of services away, when it fails, who would know how to fix it? The chaos engineer wouldn’t be an expert in the service itself, they would just be an expert in the algorithm in which the timeout was decided. The service owner wouldn’t be an expert in how the timeout was calculated for their service, they’d just know that things are not operating as expected. This would inevitably lead to uncertainty during an incident.
I think the best thing to do with vulnerabilities you find with your chaos tools is to bring context to the service owner, not automate fixes, or chase down vulnerabilities — which is exactly what Monocle did.
The Apollo 1 crew members expressed their concerns about their spacecraft’s problems by presenting this parody of their crew portrait to ASPO manager Joseph Shea on August 19, 1966. Joseph Shea was the one making the safety calls, but he wasn’t the expert. The picture had written on it:
“It isn’t that we don’t trust you, Joe, but this time we’ve decided to go over your head.”
These two items (practicing manual gamedays and doing chaos experiments in non-production environments) are not only immensely valuable, they are also completely necessary before experimenting in production or automating chaos experiments. I’ve seen a lot of good come out of both of these practices. It’s true that there is no data like production data, but please, please, practice gamedays and run sandbox experiments before moving to production or automating experiments.
When I was at Netflix, we got to a point where we wanted to automate the configuration and running of chaos experiments. However, configuring these experiments automatically was a long journey, and we had to gather a lot of data on service configurations from several different sources. None of this data could be found in the same places, despite the fact that all of the information was related. Once we gathered all of this information and aggregated it, we thought it made sense to put it up in a UI, as service owners would probably find it valuable. This was Monocle.
Remember the Monocle UI I showed earlier? That was an immensely valuable artifact of trying to configure and automate experiments in production, but producing this UI wasn’t the main goal. Building the Monocle UI was a surprise, in that we didn’t know we were going to build it when we originally decided to automate experiment configuration. Which brings me to my next point:
We learned so much more by going through this process than the results of actual automation showed us (even though those results were good too!)
The process of creating the experiment and sharing learnings are arguably the highest returns on investment of Chaos Engineering.
Once we were able to show the vulnerabilities we found in Monocle to service owners — the value we got out of creating the experiments was more than the value we got than actually running them (some of them we didn’t even need to run, we could already see the vulnerabilities).
I didn’t have time in this ‘traps’ talk to go over all of these phases, but I plan to focus on the benefits of each of these phases in my June talk at Velocity.
In this ‘traps’ talk, I focused on some of the benefits of the ‘before’ phase of Chaos Engineering:
This is a story I made up about a piece of infrastructure I admire dearly, though have seen many incorrect assumptions about in the wild.
Incidents are an opportunity for us to sit down and see how person A’s (Josie) mental model of how the system worked is different than person B’s (Javi) mental model of how the system worked. Incidents give us this opportunity because, at some level, they disprove what we thought about how the system operated and handled risk.
For example, Josie, a service owner, thought that she could expect 100% uptime from Consul, and if not, that Consul could handle it. Whereas Javi (the architect of Consul at $COMPANY), assumed that users of Consul would never even make that assumption — especially being someone that came from the world of ops. However, Josie and Javi never explicitly had this conversation about expectations, because reasonable assumptions (given the context they had) were made on both sides. The assumptions made by both parties were completely reasonable given the context they individually had about how Consul at $COMPANY worked.
Javi, given his lengthy experience as an operator, assumed that no one would expect 100% uptime from a system and retries would be configured by services using Consul — he never put this in his documentation, nor set up things in Consul that would mitigate the risk if someone assumed 100% uptime.
Josie had been using Consul for various things, for awhile now and loved it — it saved her a lot of time and energy on storing important key value pairs, and the flexibility to quickly change them if she needed. It did exactly what she wanted. Until it didn’t. There was a massive outage, partially as a result of this mismatch of expectations and the lack of opportunity to discuss inherent assumptions. This conversation didn’t happen because how do you discuss assumptions that people assume are obvious? You can’t, really, until you have an incident.
This is where chaos experiments come in. They serve as a chance for us to find vulnerabilities in our systems, yes, but even more importantly — they serve as an opportunity for us to see how our mental model and assumption of what will happen when $X fails, is different from our teammate’s mental model and assumption of what happens when $X fails. You may find people are more open to discussing these differences during chaos experiment planning than they are during an incident, because well, an incident hasn’t happened.
The designer/facilitator of the experiment should (ideally) be someone from outside the team to navigate and understand the biases that team members may carry about the system — they facilitate the conversation, ask questions, state their opinion, and aggregate hypotheses about the system. They seek feedback from folks on the team and use that feedback to derive a game-plan.
Though this process can be time intensive, it is important (at least in the design of the chaos experiment phase), to gather the mental model of different teammates and develop a hypothesis. This hypothesis should not come from the designer/facilitator. The designer/facilitator should interview teammates in a group setting (roughly 30 min to an hour) and give each teammate a chance to talk about their understanding of the system and what happens when it fails. You get the most value out of Chaos Engineering when multiple people, from different teams involved in the experiment, are in a room together discussing their expectations of the system. Distilling the explicit information about what you learned during this phase is equally as important (if not more important) as executing the experiment.
Figuring out what to experiment on is one of the biggest returns on investment of Chaos Engineering.
The goal of the design section of chaos experiments is an exercise to get teammates to discuss their differing mental models of the system in question. A secondary goal is to develop strategies behind how to design the experiment, prior to executing it, in a structured way. It’s important to prepare for this in a structured setting to avoid questions around the hypothesis of the experiment and the design during the experiment.
John Allspaw has written extensively on mental model recalibration. He recently wrote about this with regard to chaos experiments in an InfoQ e-mag on Chaos Engineering.
Back to the Apollo 1 experiment. The plug door hatch, which was meant to keep the astronauts safely inside during spaceflight, was too hard to open due to internal pressure. It’s ironic that something designed to keep you safe led to a catastrophic failure — this is not uncommon. Dr. Richard Cook refers to these types of anomalies as “vulnerable defenses”.
It’s important to understand the defenses we put in place and to test their vulnerabilities as well. The vulnerabilities of the defenses are often difficult to see. Chaos Engineering can help shed light on the vulnerabilities of defenses.
Do you know what happens when your defenses fail? How?
I frequently see people erring on the side of just making something fail and figuring it out as they go along. This is a very common trap to get into and get excited about.
Making things fail is the easy part — understanding how we react to failure is the hard part, it’s important to plan this ahead of time. And by ‘react’ I don’t just mean the technical ways the system reacts, I mean how people react as well. What dashboards do they look at? What logs do they immediately check? How do they know to go there?
Gus Grissom was part of the Apollo 1 crew. He was also in the cabin for the launch experiment that ended in tragedy, one month after he gave this interview.
This is not a plug and play model — any tool you have in place, whether it be purchased or home grown should be facilitated by “experts in the specific system” they are experimenting on.
You’ll be shooting in the dark without understanding things about your system such as traces, incident patterns, etc. prior to running chaos experiments.
I recently joined a new organization. Part of my mandate is to do Chaos Engineering. Am I going to start creating chaos everywhere right now? Probably not this time. I’ve learned my lesson there after having ‘implementing Chaos Engineering’ be one of the very first projects I did at another organization:
Though I gained a lot from this experience, and we did learn things about our system, I’d recommend going about this a different way.
The principlesofchaos.org states: “Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.”
There is no prescriptive formula for doing Chaos Engineering. Which is why this slide is out of order. There is no exact template you should use when conducting gamedays or creating chaos experiments.
There are pieces of advice that should be followed when creating/running/dissecting chaos experiments, and a science that can be applied, but you can’t have a prescriptive formula when talking about the unpredictable.
I hope you come away from this thinking about the perspective and approach, not the prescription, and eschew the idea that there is a one size fits all ‘chaos’ you can apply.
The more we learn about Chaos Engineering, and the more we talk about it, the better we will get at it. If we can be aware of these 8 traps, we can be better about using alternative strategies instead. Remember, it’s about sharing how we cope with uncertainty, not just how we don’t respond to it ‘appropriately’.
Now back to my 2011 Buzzfeed article: