Root cause analysis (RCA)
Root cause analysis (RCA) is defined as a collective term that describes a wide range of approaches, tools, and techniques used to uncover the causes of problems.
The main benefit of RCA is that it finds the fundamental errors in the development process, enabling teams to enact the proper measures to fix the problems and stop them from recurring ahead.
Guideline
Once the problem occurs, a root cause can be readily identified and properly handled. The right communication is the key here.
Steps
When conducting a successful root cause analysis, the more participants who can contribute you have, the better. Plan the sequence of activities and execute the steps below in the shortest period of time.
Step 1: Define the Issue
- What kind of issue do we see?
- What are the specific symptoms?
- Can we locate the causes?
Step 2: Collect Data
- Do we have as much data and input as possible?
- How long has the issue existed?
- What is the impact and severity of the issue?
Step 3: Identify Possible Causal Factors
- What sequence of events leads to the issue?
- What are the conditions that allow the issue to occur?
Step 4: Identify the Root Cause
- Why does the causal factor exist?
- What is the real reason the issue occurred?
Step 5: Plan and Implement Solutions
- Is a solution easy to find and apply asap?
- Shall we apply a temporary fix or do the hard fix right away?
- Is there a plan to monitor the stability of that fix?
- What can you do to prevent the issue from happening again?
- What are the risks of implementing the solution?
Clients must be informed about the process throughout the timeline until the issue is resolved. See the email chain samples below.
1.Subject: Emergency Maintenance / Stability issues
Yesterday/Today, we had some latency/timeout issues on one of our instances that caused <COMPONENT> outage. Although we were able to mitigate some of these issues through <TEMP_SOLUTION>, we have determined that we need to <PERMANENT_SOLUTION>.
An additional notification will follow with details regarding the related production changes. That notification will only be sent to clients that are impacted. The production changes will be performed during non-business hours.
If you have any questions or concerns, please feel free to reach out to our support team.
Thank you
2. Subject: Maintenance (UPDATE)
While the team continues to work on the issue, we have observed that <COMPONENT> is beginning to return to normal. Additionally, we can confirm that our <SYSTEM/COMPONENT> is back online. We will provide you with updates as we progress through the rest of the process.
3. Subject: UPDATE: Intermittent errors
We are still actively investigating this outage as we continue to see issues with <COMPONENT> errors. Please hold on using <FEATURE_NAME> feature/tool.
An update will be provided within the next 48 hours.
Thank you
4. Subject: RCA of <DATE> Outage (#1)
Dear <NAME>,
The attached file contains the root cause analysis (RCA) for the Production outage which occurred on <DATE>. Technical information can also be found within that report.
Please let us know if you need any further assistance.
Thank you
The description in the emails to clients must contain the following information:
- general overview
- technical review
- current status
Template
To track the status of findings, one may use the following template. It is recommended to keep track of such cases to perform effective retrospective analysis. Having a unified format helps to log the details in a structured manner.
DevCom is a trusted technology partner for many of the world’s leading enterprises, SMEs and technology innovators. Through every stage of the product life cycle, DevCom is a brain-trust dedicated to forward-thinking.
In case you don’t know where to start your project, you can get in touch with us.
Our Blog: https://devcom.com/articles/