Effective RCA

Varun
3 min readMar 9, 2022

Root cause analysis (RCA) is a process to help one to understand the real causes behind a problem in order to learn why that problem arose in the first place. By digging deeper using analysis techniques to collect data, one can then form an action plan that will enable identifying the contributing factors of your problem in order to prevent it from occurring again. One must also derive learning out of these analysis and ensure to incorporate them while building similar things.

In order to perform an root cause analysis, one must go through these steps:

  • Define problem
  • Gather data
  • Deep dive
  • Solution & Leanings

One can add more drilled down steps, but at uber level, one can categorize the analysis into these four.

Define Problem

In order to form a problem statement, one needs to identify the symptoms very precisely and analyze what wrong is happening in the system. This will help to come up with a crisp problem statement or issue statement for RCA.

Gather Data

Before deep diving into the analysis of the problem, it’s important to gather some important information i.e.

  • Time to detect (TTD)
  • Time to resolve (TTR)
  • Impact (business & technical, as applicable)

One may capture other metrics details also like incident timeline etc. but those can be optional.

Deep Dive

This is one of the most crucial steps where one tends to miss things. Deep dive can be done by asking proper questions over the problem statement. The typical method being followed is five why analysis which means asking “Why did this problem happen?” and then following the answer up with a series of additional “But why?” questions until the root cause of the problem is identified. Another important point here is, one shall not be constrained to ask these questions only for the problem statement, however the same line of thought and questions should also go for three important pointers captured while gathering the data (second step) if they are breaching the threshold or desired number. To illustrate, below shall be your line of questions

  • Why was the <issue>? (mandatory)
  • Why was the MTTD so high? (if applicable)
  • Why was the MTTR high? (if applicable)
  • Why was the <impact> number so high? (if applicable)

The answer of the above questions shall be followed by a series of “But why?” questions till the root cause is identified.

Solutions & Learnings

A most important step of the RCA which enables one to ensure that the same issue doesn’t reoccur. The mitigation of the issue can be different levels;

  • Short term is the set of immediate fixes that needs to be done in order to resolve the issue and bring the system(s) back to behave normally. This majorly targets config changes, hot fixes, data fix etc.
  • Medium term is the set of fixes that are done post the users are unblocked and important features of the product are live again. This majorly involves sanity of corrupted data, deployment of code fixes etc.
  • Long term is the set of tasks that needs to be picked post the issue is resolved majorly to ensure that the problem is fixed permanently. In case the time to detect is high, proper alert and monitoring is in place. In case the time to resolve is high, ensure logs, capability etc. required to debug the issue faster are in place. One has to make sure that these action items (AIs) are being tracked somewhere and plan to take them closure as soon as possible.

Along with AIs, learning(s) shall also come as part of output from every series of “Why” and these learnings shall be broadcasted to the team to ensure they are taken care when similar things are build in future. The learning can be as small as tweaking a configuration but the change and reason for change shall be captured and communicated.

--

--