Stop Doing Root Cause Analysis

It’s not benefiting you in ANY measurable way

j:hand
j:hand
Jul 24, 2017 · 4 min read

If you are in IT … please … for your own good .. Stop!

https://upload.wikimedia.org/wikipedia/commons/2/20/Root_Cause_Analysis_Tree_Diagram.jpg

A few years ago I came to two critical realizations about working in IT.

The first realization is that there is NEVER a root cause to IT problems.

Obviously some “thing” went wrong and of course we want to understand “what” and “how” .. but within the world of complex IT systems there are too many moving parts to triangulate a single “cause”.

Many factors contribute to failure (and success). Period.

This is the reality of managing modern IT systems. Unfortunately, the reluctance of IT professionals (especially leadership) to accept this reality is hamstringing organizations in ways they can’t “feel” until it’s too late. Sadly, until they feel the pain, they don’t seek out improvements. In IT and the businesses that are supported by them, that typically happens way too late.

If you aren’t focusing your efforts on identifying, learning, and improving what you know about your “system” in a holistic way .. you haven’t quite grasped what your actually dealing with in today’s IT.

The satisfaction of (incorrectly) identifying “cause” during retrospective examination brings the entire exercise and learning opportunity to a complete stop and wastes key business resources such as time, effort, and money!

Correlation of factors involved in the problem is what I actually discovered. I didn’t realize I was making unconscious and incorrect conclusions about what actually took place, especially from a broader, holistic, “systems thinking” viewpoint. I oversimplified complexity and it lead me and my company to a very bad place.

https://imgs.xkcd.com/comics/correlation.png

The second realization was that the methods I previously used to retrospectively understand what happened when things went wrong (i.e. Root Cause analysis) wasn’t making anything better. Literally nothing.

RCA was the process put in place and I adhered to it. No questions asked. Despite the fact that the same type of things kept happening, but in slightly different ways, I kept repeating the retrospective analysis the same exact way.

I became very good at identifying something that stopped working, came unplugged, or crashed spectacularly that “caused” the problem… but if I’m being completely honest .. identification of a “cause” didn’t prepare me for the inevitable reoccurrence of something very similar in the future.

Maybe I found out a network switch overheated and stopped working but it didn’t help to prevent something similar from happening again.

Understanding the cause of a destructive wildfire may help seek out methods of observability and detection but it doesn’t prevent a repeat occurrence. More importantly, understanding cause does little to educate and improve a fire response crew on what to do when (NOT if) something happens again.

What I CAN do .. is understand more about how systems behave (including the “people” part) of response to problems.

My Personal Epiphany

THE MOST IMPORTANT THING I realized about performing RCAs following an IT problem was that I wasn’t spending ANY time evaluating how well (or poorly) my processes (including the “people” part) helped me to know about a problem sooner .. and subsequently recover from a problem sooner.

I was fixated on cause …
Because someone told me I should be.

I spent all of my time and attention searching for (but actually manifesting) the root cause of the problem. Once I had concluded the cause, I established a fix, and I convinced myself and others that the problem was solved and would likely never happen again.

I was wrong. That company no longer exists. My approach to examining failure with IT systems was a contributing factor to the demise of the business.

Current and future IT professionals must come to the following conclusion:

Our best hope for increasing uptime of IT systems is consistently evaluating the methods we use to “know about” and “respond to” problems. Reducing the time it takes to “know about” AND “recover from” failures is mandatory for survival. Analyzing the response to problems in addition to factors that contributing the the problem in the first place means individuals, teams, orgs, and the business as a whole has a deeper understanding of their system and are poised to continuously improve all aspects related to maintaining maximum uptime.

I recently released a 90+ page book that explains all of this in great detail .. and I’d love for all IT professionals to take a look and share your thoughts with me!

Have you adopted more modern practices to the increasing importance of reliability and availability as it relates to uptime of critical IT services? Why not? What’s preventing you from leveling up? Is it the way you approach learning from failure? Is it the lack of approach at all? Are existing processes tethering you to “the old way”? What stories do you have to share?

FREE “Post-Incident Reviews” book from O’Reilly Media.

http://jhand.co/PIR_Medium

j:hand

Written by

j:hand

Author, speaker, advocate for building resilient systems and people

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade