Learning More When Things go Wrong: The Root Cause Analysis
Once, while touring the HQ of Rackspace in San Antonio, Texas, we stopped at a wall of diagrams that looked like fishes, but with a ton of writing and technical diagrams on them.
Looking closer, our friend and guide, told us that this was the wall of RCAs, or Root Cause Analysis. An RCA is a process you use to determine what happened when something goes wrong. And what steps you’re taking to learn from it. Note: it’s not an attempt to make sure nothing ever goes wrong again, but rather that you can learn from each mistake.
At one of the product technology companies where I am CTO, we adopted the RCA template. Whenever we have an outage or a client on our platform is impacted, we do an RCA. We randomly assign someone to perform it, and they go off and collect the data and do the RCA. Each section has a set of questions, so anyone can facilitate it. While it helps to be close to the problem, the process benefits by having someone else (who may not have even known or experienced the problem) fill it out. That way, they can focus on the facts, and ask the relevant questions (hint: “why?” several times) to get to the root cause.
We’re not trying to prevent mistakes, in fact, at high performing companies where I’ve worked, it’s OK to make them. I’d argue you learn more from the mistakes than you do when everything goes right.
Here’s our RCA template (adapted and modified from versions found online and from Rackspace’s RCA Overview, thank you!):
Step 1: Define the Problem
Provides known specific details regarding the issue
Clearly States the Problem Statement — “What”
Specifies the Date & Time of the problem — “When”
Identifies the location, area, equipment — “Where”
Step 2: The Fishbone Diagram (optional) / Brainstorming
The Fishbone Diagram should be considered an organized brainstorming activity and is comprised of six “bones” that are derived from the head (problem statement) of the diagram. The diagram helps to group known and potential causes of the problem, in turn helping to identify potential action items.
Step 3: High Level Cause Map
This section is intended to answer the “Why” at a high level. Working backwards from the problem statement and utilizing the fishboane diagram results to identify true “Root Cause”.
Step 4: Summary of Issue & Cause(s)
This section is used to provide a summary of the issue, cause(s), timeline of the actions that lead to the issue occurring, and any remediation steps taken to stabilize the event.
A picture is worth a thousand words. When possible, we use pictures to visually identify the cause/problem in an effort to have individuals better relate to the issue being discussed.
Step 5: Action Items Leading Towards Resolution
Using the fishbone and the cause map, this section lists specific action items, owners, due dates and current status of the action item.