Chaos Engineering — What and who is a chaos engineer?
Answering questions from my webinar
I recently did a two-hour webinar dedicated to chaos engineering and got a lot of great questions from the audience. In this mini-series of posts, I will take some time to answer them.
If you missed the webinar, you can access it on-demand from the link below. And if you have questions you would like me to address, feel free to ask me directly on Twitter :-)
These are some of the question I was asked:
Who’s the best set of people to start looking into chaos engineering in a team?
How can performance engineers drive chaos engineering ideas?
In general, whose responsibility is chaos engineering? Would this fall to the solutions architect/engineering team, a Business Continuity team, or a ‘virtual’ team that spans all teams involved in the application?
Great set of first questions! I grouped them since they are very similar to one another.
First of all, let’s debunk a myth. The myth of the chaos engineer going around service teams and surprising them with breaking things randomly, without noticing them, and hoping developers will keep smiling.
It is a myth!
Chaos engineers are more likely to be advocates, helping teams understand what chaos engineering is and how to prepare for it, explaining and even demoing how to do it, and in most cases, coordinating the execution of experiments and GameDays*. But they work WITH the teams, not against them.
I like to think of chaos engineers as program managers instead, with a strong background in software engineering, a good understanding of resiliency patterns, and, more importantly, a passion for the practice of chaos engineering — a contagious passion. Driving the adoption of chaos engineering practices happens through technical presentations and workshops, writing and sharing ideas, support meetings, brainstorming sessions, running GameDays, celebrating wins, etc. The chaos engineer is an evangelist of the discipline, not necessarily the one that pulls the trigger.
Chaos engineering is a practice more than a job definition, and thus everyone in the software engineering or operation teams can use the chaos engineering methodology to improve their systems. Often the best person to do fault injections in a software system is the ones most intimate with the software system itself. Yes, I am talking about the developer!
The best way to start a chaos engineering practice is thus to start a chaos engineering program** and elect a champion for the job. That champion can be a new hire or not — the important is that the champion needs a strong background in software engineering and a passion for chaos engineering. The rest is like everything; it can be learned.
If you can’t afford to hire someone dedicated to the role, you still will need a program and someone managing it. A program gives substances to an idea, something to show progress and hold onto when things get harder. The program needs some goals. Without goals, there isn’t accountability. However, setting goals requires the full awareness of the possible biases associated with setting goals and capturing metrics .
Goals like “reducing the number of sev1 tickets” are not suitable as they don’t focus on learning and can be fooled easily by merely not raising ticket severity (which will have a negative impact).
Goals such as “conducting one GameDay a month, with each team” are better since they focus on the action, not the result. Remember, we are trying to setup a new practice, learn new ways of thinking about systems, and the outcome of that is hard to measure directly. Sure, you will see some short terms and long terms benefits, but they often differ between organizations.
Ask yourself this simple question: “What do we want to learn?.” Then, create the program and goals around that simple idea. Have realistic goals too — chaos engineering will never remove all the risks and potential failures in your system.
* The term GameDay was coined by Jesse Robbins when he worked at Amazon. A GameDay is an exercise during which teams practice responding to an incident in a “safe” environment by purposefully injecting failures in order to increase the availability of software systems. A GameDay is like a fire drill. His talk from 2011 is still my all-time favorite talk.
** I will address the question “how to start a chaos engineering program” in a later post since it deserves its own post.
More reading about chaos engineering: