Birth of Chaos
Editor’s Note: Authors Casey Rosenthal and Nora Jones are two of the field’s most prominent figures, having pioneered the discipline of chaos engineering while working together at Netflix. Casey is CEO and co-founder of Verica; formerly the Engineering Manager of the Chaos EngineeringTeam at Netflix. Nora Jones is Co-founder and CEO of Jeli.
In this excerpt from their book, Chaos Engineering, the authors give a brief review of the origins of this discipline, how and why it emerged when it did, the elements of the culture of trust that made it possible, the tools essential to its success, and how a community grew up around it.
Chaos Engineering is still a relatively new discipline within software development. This introduction lays out the history, from the humble beginnings of the practice through to the current epoch of all major industries adopting the practice in some form. Over the past three years, the question has changed from “Should we do Chaos Engineering?” to “What’s the best way to get started doing Chaos Engineering?”
The history of our nascent discipline explains how we transitioned from the first to the second question just posed. We don’t want to merely tell a story of dates and motions to get the facts straight. We want to tell the story of how this emerged, so that you understand why it emerged the way that it did, and how you can learn from that path in order to get the most out of the practice.
The story begins at Netflix, where the authors of this book, Casey Rosenthal and Nora Jones, both worked when the Chaos Team defined and evangelized Chaos Engineering.¹ Netflix found real business value in the practice, and when others saw that, a community grew up around the discipline to spread it throughout tech.
Management Principles as Code
Beginning in 2008, Netflix made a very public display² of moving from the datacenter to the cloud. In August of that year, a major database corruption event in the datacenter left Netflix unable to ship DVDs for three days. This was before streaming video was ubiquitous; DVD delivery was the bulk of their business.
The thinking at the time was that the datacenter locked them into an architecture of single points of failure, like large databases, and vertically scaled components. Moving to the cloud would necessitate horizontally scaled components, which would decrease the single points of failure.
Things didn’t go exactly as planned. For one thing, it took eight years to fully extract themselves from the datacenter. More relevant to our interests, the move to horizontally scaled cloud deployment practices did not coincide with the boost to uptime of the streaming service that they expected.³
To explain this, we have to recall that in 2008, Amazon Web Services (AWS) was considerably less mature than it is now. Cloud computing was not yet a commodity, and not nearly the no-brainer, default deployment option that we have today. Cloud service back then did hold a lot of promise, and one of those promises was that instances⁴ would occasionally blink out of existence with no warning. This particular form of failure event was rare in a datacenter, where big powerful machines were well tended and often the idiosyncrasies of specific machines were well understood. In a cloud environment, where that same amount of power was provided by many smaller machines running on commodity hardware, it was an unfortunately common occurrence.
Methods of building systems that are resilient to this form of failure event were well known. Perhaps half a dozen common practices could have been listed that help a system automatically survive one of its constituent components failing unexpectedly: redundant nodes in a cluster, limiting the fault domain by increasing the number of nodes and reducing the relative power of each, deploying redundancies in different geographies, autoscaling and automating service discovery, and so on. The specific means for making a system robust enough to handle instances disappearing was not important. It might even be different depending on the context of the system. The important thing was that it had to be done, because the streaming service was facing availability deficits due to the high frequency of instance instability events. In a way, Netflix had simply multiplied the single-point-of-failure effect.
Netflix wasn’t like other software companies. It proactively promoted cultural principles that are derived from a unique management philosophy outlined in a culture deck. This manifested in several practices that had a strong bearing on how Netflix solved the availability deficit. For example:
- Netflix only hired senior engineers who had prior experience in the role for which they were hired.
- They gave all engineers full freedom to do anything necessary to satisfy the job, concomitant with the responsibility of any consequences associated with those decisions.
- Crucially, Netflix trusted the people doing the work to decide how the work got done.
- Management didn’t tell individual contributors (ICs) what to do; instead, they made sure that ICs understood the problems that needed to be solved. ICs then told management how they planned to solve those problems, and then they worked to solve them.
- High performance teams are highly aligned and loosely coupled. This means that less effort needs to be put into process, formal communication, or task management if everyone shares the same goal across teams.
This interesting dynamic is part of what contributed to Netflix’s high-performance culture, and it had an interesting consequence in the development of Chaos Engineering. Because management’s job wasn’t to tell ICs what to do, there was essentially no mechanism at Netflix for any one person or team or group to tell the rest of the engineers how to write their code. Even though a half dozen common patterns for writing services robust enough to handle vanishing instances could have been written down, there was no way to send an edict to the entire engineering organization demanding that everyone follow those instructions.
Netflix had to find another way.
Chaos Monkey Is Born
Many things were tried, but one thing worked and stuck around: Chaos Monkey. This very simple app would go through a list of clusters, pick one instance at random from each cluster, and at some point during business hours, turn it off without warning. It would do this every workday.
It sounds cruel, but the purpose wasn’t to upset anyone. Operators knew that this type of failure — vanishing instances — was going to happen to every cluster at some point anyway. Chaos Monkey gave them a way to proactively test everyone’s resilience to the failure, and do it during business hours so that people could respond to any potential fallout when they had the resources to do so, rather than at 3 a.m. when pagers typically go off. Increasing the frequency to once per day then acts somewhat like a regression test, making sure they would not experience drift into this failure mode down the line.
Netflix lore says that this was not instantly popular. There was a short period of time when ICs grumbled about Chaos Monkey. But it seemed to work, so more and more teams eventually adopted it.
One way that we can think of this application is that it took the pain of the problem at hand — vanishing instances affected service availability — and brought that pain to the forefront for every engineer. Once that problem was right in front of them, engineers did what they did best: they solved the problem.
In fact, if Chaos Monkey was bringing their service down every day, then they couldn’t get any work done until they solved this problem. It didn’t matter how they solved it. Maybe they added redundancy, maybe scaling automation, maybe architectural design patterns. That didn’t matter. What did matter is that the problem got solved somehow, quickly, and with immediately appreciable results.
This reinforces the “highly aligned, loosely coupled” tenet of Netflix’s culture. Chaos Monkey forced everyone to be highly aligned toward the goal of being robust enough to handle vanishing instances, but loosely coupled as to how to solve this particular problem since it doesn’t suggest the solution.
Chaos Monkey is a management principle instantiated in running code. The concept behind it seemed unique and a bit wonky, so Netflix blogged about it. Chaos Monkey became a popular open source project, and even a recruiting tool that introduced Netflix to potential candidates as a creative software engineering culture, not just an entertainment company. In short, Chaos Monkey was designated a success. This set a precedent and helped establish this form of risk-taking/creative solutioning as a part of Netflix’s cultural identity.
Fast-forward to December 24, 2012, Christmas Eve.⁵ AWS suffered a rolling outage of elastic load balancers (ELBs). These components connect requests and route traffic to the compute instances where services are deployed. As the ELBs went down, additional requests couldn’t be served. Since Netflix’s control plane ran on AWS, customers were not able to choose videos and start streaming them.
The timing was terrible. On Christmas Eve, Netflix should have been taking center stage, as early adopters showed their extended family how easy it was to stream actual movies over the internet. Instead, families and relatives were forced to speak to each other without the comforting distraction of Netflix’s content library.
Inside Netflix, this hurt. Not only was it a hit to the public image of the company and to engineering pride, but no one enjoyed being dragged out of the Christmas Eve holiday by a paging alert in order to watch AWS stumble through the remediation process.
Chaos Monkey had been successfully deployed to solve the problem of vanishing instances. That worked on a small scale. Could something similar be built to solve the problem of vanishing regions? Would it work on a very, very large scale?
Every interaction that a customer’s device has with the Netflix streaming service is conducted through the control plane. This is the functionality deployed on AWS. Once a video starts streaming, the data for the video itself is served from Netflix’s private network, which is by far the largest content delivery network (CDN) in the world.
The Christmas Eve outage put renewed attention internally on building an active–active solution to serving traffic for the control plane. In theory, the traffic for customers in the Western hemisphere would be split between two AWS regions, one on each coast. If either region failed, infrastructure would be built to scale up the other region and move all of the requests over there.
This capability touched every aspect of the streaming service. There is a propagation delay between coasts. Some services would have to modify things to allow for eventual consistency between coasts, come up with new state-sharing strategies, and so on. Certainly no easy technical task.
And again, because of Netflix’s structure, there is no mechanism to mandate that all engineers conform to some centralized, verified solution that would certifiably handle a regional failure. Instead, a team backed by support from upper management coordinated the effort among the various affected teams.
To ensure that all of these teams had their services up to the task, an activity was created to take a region offline. Well, AWS wouldn’t allow Netflix to take a region offline (something about having other customers in the region) so instead this was simulated. The activity was labeled “Chaos Kong.”
The first several times Chaos Kong was initiated, it was a white-knuckle affair with a “war room” assembled to monitor all aspects of the streaming service, and it lasted hours. For months, Chaos Kong was aborted before moving all of the traffic out of one region, because issues were identified and handed back to service owners to fix. Eventually the activity was stabilized and formalized as a responsibility of the Traffic Engineering Team. Chaos Kongs were routinely conducted to verify that Netflix had a plan of action in case a single region went down.
On many occasions, either due to issues on Netflix’s side of things or otherwise to issues with AWS, a single region did in fact suffer significant downtime. The regional failover mechanism used in Chaos Kong was put into effect in these cases. The benefits of the investment were clear.⁶
The downside of the regional failover process was that it took about 50 minutes to complete in the best-case scenario because of the complexity of the manual interpretation and intervention involved. In part by increasing the frequency of Chaos Kong, which in turn had an impact on the internal expectations regarding regional failover within the engineering organization, the Traffic Engineering Team was able to launch a new project that ultimately brought the failover process down to just six minutes.⁷
This brings us to about 2015. Netflix had Chaos Monkey and Chaos Kong, working on the small scale of vanishing instances and the large scale of vanishing regions, respectively. Both were supported by the engineering culture and made demonstrable contributions to the availability of the service at this point.
Formalizing the Discipline
Bruce Wong created a Chaos Engineering Team at Netflix in early 2015 and left the task of developing a charter and roadmap to Casey Rosenthal. Not quite sure what he had gotten himself into (he was originally hired to manage the Traffic Engineering Team, which he continued to do simultaneously with the Chaos Engineering Team), Casey went around Netflix asking what people thought Chaos Engineering was.
The answer was usually something along the lines of, “Chaos Engineering is when we break things in production on purpose.” Now this sounded cool, and it might make a great addition to a LinkedIn profile summary, but it wasn’t very helpful. Anyone at Netflix with access to a terminal had the means to break things in production, and chances are good that it wouldn’t return any value to the company.
Casey sat down with his teams to formally define Chaos Engineering. They specifically wanted clarity on:
- What is the definition of Chaos Engineering?
- What is the point of it?
- How do I know when I’m doing it?
- How can I improve my practice of it?
After about a month of working on a manifesto of sorts, they produced the Principles of Chaos Engineering. The discipline was officially formalized.
The super-formal definition settled upon was: “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” This established that it is a form of experimentation, which sits apart from testing.
The point of doing Chaos Engineering in the first place is to build confidence. This is good to know, so that if you don’t need confidence, then this isn’t for you. If you have other ways of building confidence then you can weigh which method is most effective.
The definition also mentions “turbulent conditions in production” to highlight that this isn’t about creating chaos. Chaos Engineering is about making the chaos inherent in the system visible.
The Principles goes on to describe a basic template for experimentation, which borrows heavily from Karl Popper’s principle of falsifiability. In this regard, Chaos Engineering is modeled very much as a science rather than a techne.
Finally, the Principles lists five advanced practices that set the gold standard for a Chaos Engineering practice:
- Build a hypothesis around steady-state behavior
- Vary real-world events
- Run experiments in production
- Automate experiments to run continuously
- Minimize blast radius
Each of these is discussed in turn in the following chapters.
The team at Netflix planted a flag. They now knew what Chaos Engineering was, how to do it, and what value it provided to the larger organization.
Community Is Born
As mentioned, Netflix only hired senior engineers. This meant that if you want to hire Chaos Engineers, you needed a pool of experienced people in that field from which to hire. Of course, since they had just invented the discipline, this was difficult to do. There were no senior Chaos Engineers to hire, because there were no junior ones, because outside of Netflix they didn’t exist.
In order to solve this problem, Casey Rosenthal decided to evangelize the field and create a community of practice. He started by putting together an invitation-only conference called “Chaos Community Day” in Autumn 2015. It was held in Uber’s office in San Francisco, and about 40 people attended. The following companies were represented: Netflix, Google, Amazon, Microsoft, Facebook, DropBox, WalmartLabs, Yahoo!, LinkedIn, Uber, UCSC, Visa, AT&T, NewRelic, HashiCorp, PagerDuty, and Basho.
Presentations were not recorded, so that people could speak freely about issues they had convincing management to adopt the practice, as well as discuss “failures” and outages in an off-the-record manner. Presenters were chosen in advance to speak about how they approached issues of resilience, failure injection, fault testing, disaster recovery testing, and other topics associated with Chaos Engineering.
One of Netflix’s explicit goals in launching Chaos Community Day was to inspire other companies to specifically hire for the role “Chaos Engineer.” It worked. The next year, Chaos Community Day was held in Seattle in Amazon’s Blackfoot office tower. A manager from Amazon announced that after the first Chaos Community Day, they had gone back and convinced management to build a team of Chaos Engineers at Amazon. Other companies were now embracing the title “Chaos Engineer” as well.
That year, 2016, attendance went up to 60 people. Companies represented at the conference included Netflix, Amazon, Google, Microsoft, Visa, Uber, Dropbox, Pivotal, GitHub, UCSC, NCSU, Sandia National Labs, Thoughtworks, DevJam, ScyllaDB, C2, HERE, SendGrid, Cake Solutions, Cars.com, New Relic, Jet.com, and O’Reilly.
At the encouragement of O’Reilly, the following year the team at Netflix published a report on the subject, Chaos Engineering, which coincided with several presentations and a workshop at the Velocity conference in San Jose.
Also in 2017, Casey Rosenthal and Nora Jones organized Chaos Community Day in San Francisco at Autodesk’s office at 1 Market Street. Casey had met Nora at the previous Chaos Community Day when she worked at Jet.com. She had since moved over to Netflix and joined the Chaos Engineering Team there. More than 150 people attended, from the usual suspects of large Silicon Valley companies operating at scale as well as various startups, universities, and everything in between. That was in September.
A couple of months later, Nora gave a keynote on Chaos Engineering at the AWS re:Invent conference in Las Vegas to 40,000 attendees in person and an additional 20,000 streaming. Chaos Engineering had hit the big time.
As you will see throughout this book, the concepts threaded throughout Chaos Engineering are evolving rapidly. That means much of the work done in this area has diverged from the original intent. Some of it might even seem to be contradictory. It’s important to remember that Chaos Engineering is a pragmatic approach pioneered in a high-performance environment facing unique problems at scale. This pragmatism continues to drive the field, even as some of its strength draws from science and academia.
: Casey Rosenthal built and managed the Chaos Engineering Team for three years at Netflix. Nora Jones joined the Chaos Engineering Team early on as an engineer and technical leader. She was responsible for significant architectural decisions about the tools built as well as implementation.
: Yury Izrailevsky, Stevan Vlaovic, and Ruslan Meshenberg, “Completing the Netflix Cloud Migration,” Netflix Media Center, Feb. 11, 2016, https://oreil.ly/c4YTI.
: Throughout this book, we’ll generally refer to the availability of the system as the perceived “uptime.”
: In a cloud-based deployment, an “instance” is analogous to a virtual machine or a server in prior industry lingo.
: Adrian Cockcroft, “A Closer Look at the Christmas Eve Outage,” The Netflix Tech Blog, Dec. 31, 2012, https://oreil.ly/wCftX.
: Ali Basiri, Lorin Hochstein, Abhijit Thosar, and Casey Rosenthal, “Chaos Engineering Upgraded,” The Netflix Technology Blog, Sept. 25, 2015, https://oreil.ly/UJ5yM.
: Luke Kosewski et al., “Project Nimble: Region Evacuation Reimagined,” The Netflix Technology Blog, March 12, 2018, https://oreil.ly/7bafg.
Learn faster. Dig deeper. See farther.
Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.
Casey Rosenthal is CEO and cofounder of Verica, and was formerly the engineering manager of the Chaos Engineering Team at Netflix. He has experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. His superpower is transforming misaligned teams into high-performance teams, and his personal mission is to help people see that something different, something better, is possible. Nora Jones is the cofounder and CEO of Jeli. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work in practice in distributed systems. In November 2017 she keynoted at AWS re:Invent to share her experiences helping organizations large and small reach crucial availability with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. Since then she has keynoted at several other conferences around the world, highlighting her work on topics such as Resilience Engineering, Chaos Engineering, Human Factors, Site Reliability, and more from her work at Netflix, Slack, and Jet.com. Additionally, she created and founded the www.learningfromincidents.io movement to develop and open source cross-organization learnings and analysis from reliability incidents across various organizations.