Experiment: Use Critical Incident Reviews to Build Antifragility

Published in

The Liberators

5 min readSep 14, 2020

In our book — the Zombie Scrum Survival Guide — we dive deep into what causes Zombie Scrum; something that looks like Scrum from a distance, but lacks a beating heart. We also offer 40+ experiments to recover from Zombie Scrum. In this series, we share experiments that didn’t make it to the book but are still very helpful. Download the paper “10 Powerful Experiments to Overcome Zombie Scrum” to get more inspiration on how to fight Zombie Scrum.

Mistakes are inevitable in complex work. Instead of trying to avoid them, you can use them to grow stronger. This is at the core of “Antifragility”.

Instead of trying to resist variation and shocks, antifragile systems grow stronger when they are pressured. For example, engineering teams at Netflix created a tool called “Chaos Monkey” to randomly terminate services in their infrastructure. Every time a terminated service ends up causing disruptions to end-users, engineering teams redesign the architecture to reduce the impact. Over time, responding to random shocks helps Netflix to make its infrastructure stronger. Space Exploration Technologies (SpaceX) has a launch cadence that is purposefully higher than other launch providers. Every time a launch fails, their self-managed teams update technology, protocols, and processes to avoid similar failures in the future.

This experiment uses the Liberating Structure “What, So What, Now What” to help teams reflect on failure and use it to identify what can be done to make the system (of their team & organization) stronger against similar and other failures.

“Instead of trying to resist variation and shocks, antifragile systems grow stronger when they are pressured.”

Required Skill

Critical Incident Reviews benefit from clear facilitation and asking powerful questions to dig deeper.

Impact on Survival

The skill to analyze and learn from mistakes is vital for self-organization and continuous improvement.

Steps

To implement this experiment, do the following:

Run this experiment as soon as possible after your Scrum Team experiences a mistake. This can be a huge, highly impactful mistake or a smaller one. The more serious the mistake, the more time you want to take to learn from it. Include whoever was involved in, or affected by, this mistake.
First, give people time to detach themselves from the emotions surrounding the mistake. In random pairs, ask people to share their experience of the mistake and its consequences (2 min). Repeat three more times in new pairs. Then ask the whole group to share two or three patterns they noticed (5 min).
Set the stage of the Critical Incident Review. The purpose is to learn from a mistake and to prevent similar ones in the future. Emphasize that they are not to assign blame, even when the mistake is attributable to a single person or subgroup because those people are part of a larger group that could’ve helped prevent the mistake.
In small groups, ask everyone to retell the story of the mistake. Ask “Working backward from the moment you discovered the mistake, create a timeline of what happened. What were the actions? Who were the actors? How did information flow? Who was missing?”. Encourage people to channel their inner detective and leave opinions, interpretations, and conclusions aside for now. Give everyone time to get their own thinking started (2 min), then work together in their small groups to create the timeline (15 min). With the whole group, share the timelines and notice similarities and differences (10 min).
Now that the group has a better sense of what actually happened, ask “What is important about this? What does this mean about our work as a team? What conclusions can we draw?”. First individually and in silence (2 min), then invite people to share their ideas in their small groups (5 min). Capture the most important insights with the whole group (10 min).
Help the group turn their discoveries from the previous round into improvements. Ask “Now what? How can we reduce the blast radius of similar mistakes in the future or avoid them altogether?”. First individually and in silence (2 min), then invite people to share their ideas in their small groups (5 min). Capture the most important ideas with the whole group (10 min). Formulate 15% Solutions for the most promising actions.

Our Findings

Groups often attempt to prevent mistakes through standardization, examples of which include enacting policies and guidelines. The problem with policies is that they need to be followed in order to work, and even then they can stifle autonomy and creativity. Instead, help groups explore how policies can be automated (e.g. automated testing) or how the cost of mistakes can be limited through other means. Mistakes are inevitable.
Critical Incident Reviews work best when the initial emotions have settled, but the incident is still fresh in everyone’s mind; don’t wait until the end of a Sprint to review a critical incident that happened during the Sprint.

Experience from the Field

“A few years ago, most of our web-based infrastructure collapsed all of a sudden. Everyone scrambled to figure out what happened while at the same answering phone-calls and emails from concerned customers. It turned out that a few SSL-certificates had expired at the same time, some of which were used by web-services that most of our web-applications relied on. Although finding and fixing the issue took 25 minutes, it took much longer for us to appease angry customers.

During a ‘Critical Incident Review’ with the entire team, we discovered that one member of our team had been notified of the upcoming expiration a week before, but ignored the email because he assumed it had already been taken care of. Since we relied on many different SSL-certificates, we often received notifications about expiration — some important, some not. To strengthen our system against similar failures in the future, we developed three solutions.

The first was the installation of a simple monitoring tool on a screen in the team room that turned red when an SSL-certificate was within days of expiration. The second was that we switched to auto-renewed SSL-certificates wherever possible, starting with the most critical services— a technology that wasn’t available before. Finally, we made the responsibility to update SSL certificates (and monitor them) a responsibility of the entire team — not just one person. Since that review, we haven’t had other issues with expiring SSL-certificates. In fact, we found the monitoring tool so helpful that we expanded it with checks for response-time, certain security issues, and ‘chatty HTTP headers’”.

How Did it Go?

We’d love to hear how it went when you’ve tried this experiment. With your feedback, we can empirically improve experiments, add new ones, and remove what doesn’t work. Let us know in the comments how it went and/or fill in this short feedback form.

Looking for more experiments?

Aside from a deep exploration of what causes Zombie Scrum, our book contains over 40 other experiments (like this one) to try with your Scrum Team. Each of them is geared towards a particular area where Zombie Scrum often pops up. If you’re looking for more experiments, or if these posts are helpful to you, please consider buying a copy.