How to Integrate Root Cause Analysis Process Into the Deployment Pipeline?

Published in

Byborg Engineering

10 min readApr 21, 2022

What is Root Cause Analysis?

Have you ever missed your bus or train while commuting to office? Did the delay cause a serious impact on your entire day’s schedule?

Of course, it’s not a big deal if this happens once in a while but imagine missing it on a frequent basis.

I have been in the same situation multiple times, and each night before going to sleep, I used to brainstorm how to avoid it the following day. Unfortunately, I never had the chance to find a solution. On one particular occasion, it seriously impacted my whole day, resulting in major scheduling setbacks.

Then one day, I decided to dig deeper on what exactly was going wrong. From this moment of contemplation, I understood that I always snoozes my alarm two times before waking up (i.e. instead of waking up at 8:00 AM, I would get up at 8:10 AM). Although it’s a small time difference, this 10-minute delay was the catalyst in setting off a chain of events that impacted the rest of the day.

I immediately decided to change my alarm to 7:50 AM instead of 8:00 AM. I put this into practice and, after few days, I could easily follow my new daily wake-up routine. Eventually, I permanently fixed the issue.

Looking at the root cause of the above issue: the 10-minute delay in the snoozing time (cause of the problem), and the decision made to fix it: changing the alarm time (solution applied), I started adjusting the alarm accordingly to anticipate the following day (any further steps needed).

This is a generic example to explain the process flow on “What went wrong”, “How to permanently fix it”, and most importantly “How to avoid it happening again in future”.

In software development, this process referred to as “Root Cause Analysis” (RCA) and it helps to perform the above steps in a systematic and methodological approach.

Why do we need to perform RCA?

Before diving into the topic, let’s understand and justify — “Why do we need to perform RCA” ?

Naturally, it’s not a big deal when there are no bugs or relatively few bugs during and after a software release. However, it’s not realistic to never have bugs when releasing a software.

That’s why it’s very important that a software development team delivers a product with very minimal chances of producing a red bug after the release. To achieve this, the team should constantly strive to improve by understanding “What went wrong and how to fix/improve the situation”.

Every team should implement best practices in order to constantly challenge their deployment pipeline with the evolving business needs.

One of the most important parts in this process is to identify the root cause of a bug and to make sure there are no further occurrences/reduce the number of occurrences of the issue (for when the issue cannot be completely avoided).

What are the drawbacks of Traditional RCA Methods?

In today’s fast paced and Agile world, it’s essential to have the end product as soon as possible to assess the market needs. During the development process, teams need to continuously challenge the way they are delivering software and apply necessary changes to have an early end product. To achieve this, Root Cause Analysis plays a crucial role.

There are many tradition methodologies to perform RCA depending on the type of project. For Waterfall projects, RCA might be done at the end of the project as a separate entity. In this instance, different approaches may be applied such as: fishbone diagrams (Cause and Effect) and/or pareto analysis (80/20 rule). For other projects, teams might use the 5 WHY’s approach, failure mode and effect analysis (FMEA,) or a scatter diagram.

However, in today’s Agile world, do these traditonal methods really help? This is an especially important question to ask with the new generation Agile projects — “Fail Fast to Success”, where things change very quickly. We have to admit that, to some extent, the core principles of these approaches could be adapted and integrated into the RCA process flow.

However the main problem with some of these approaches are:

Managed as a separate process: They are handled separately from the team’s deployment process. In such cases, the project manager or quality manager performs the analysis with the respective lead developer/team following the release of the project.
Dedicated Stakeholder: There should be a specific dedicated stakeholder to manage the process. Even when there is a dedicated stakeholder to handle the process (someone who is responsible for managing and organizing meetings, follow-ups, etc.) the eventual goal of “fail fast to understand the root cause” may still not be effectively achieved.
Transparency on lessons learned: Eventual results might not help the team to fully understand where to improve in order to avoid finding a bug on production.

Of course, these are not applicable to all types of projects and there might be some exemptions.

The main goal of integrating RCA into the deployment pipeline is to help the teams where “agility and speed” dominates over the “slow and steady” approach.

How to handle RCA without impacting individual/team spirit?

The main purpose of RCA is to find the relevant root cause for having a blocker incident and to improve/modify — “how the software is being developed”.

It’s evident that sometimes during the process flow in performing the RCA, some stakeholders might become emotional depending on how the situation is being handled (e.g. blaming a developer for the bug).

Therefore, it’s important to keep the same spirit across the team. During this time, it is a good practice to highlight the importance of team effort when releasing the software. Eventually, if the release is successful, then everyone in the team will receive praise and appreciation. In the same way, if there is an issue with the software, everyone should be held accountable and there shouldn’t be any gesture to point fingures on an individual / department / team.

In order to avoid the situation where the RCA process is impacting individual/team sprit, it is very important that the common agenda is shared across the team and to have the process flow included in the pipeline rather than having it as a separate entity.

Root Cause for various development phases — Root Causes of Hotlines at various Development phases

What are the Core Components of the RCA process flow?

Before digging further on integrating the RCA into the deployment pipeline, let’s first understand how to approach a bug in regards to the RCA.

Let’s imagine that we have a Blocker incident on production. How many high level classifications can be done based on the type of bug?

Bugs that are caused by a release — Main feature being released is broken.
Bugs that are caused by a regression release — Breaking a feature that is not in the scope of the release.
Bugs that are in system for long time — A bug that is living in the system and no one noticed it until it’s raised.
Bugs that cannot be controlled by the development team (non-coding issues) — E.g.: Database outage, performance degradation, security issue due to an upgrade.

For any type of bug, below are the main RCA core components, to help in understand and investigate the hotline further:

Problem Category
Cause of the Problem
Solution Applied
Further Steps

Now, let’s take a closer look at each of the components.

1. Problem Category:

Immediately after deploying a hotfix, the team should identify the “Problem Category” of the bug in order to start the RCA process.

There might be numerous categories for having a bug, but it’s very important to classify the main Problem Category in order to proceed accordingly. Based on various real-time examples, we’ve created the list below of “Problem Categories” that helps when classifying a bug.

2/3. Cause of the Problem/Solution Applied:

After classifying the issue, we then need to pinpoint to the cause of the problem and apply a relevant solution.

Any bug can be classified into one of the aforementioned categories. Developer colleagues performing the RCA should be transparent and provide all the relevant information during this phase.

Bugs that were caused by a release/regression release:

Cause of the Problem: The most obvious reason would be a deployment ticket, which is causing the issue.
In this case, developers should link the respective faulty package ticket to the “Cause of the Problem” section.
Solution Applied: The first obvious solution might be to revert the package from the production as a hotfix.
However, the team should investigate further on why and how the blocker incident was missed during the development/QA phases.
Retros: Based on the investigation, necessary steps should be performed to avoid the issue happening again in the future.
If the use case is not documented, business should add the use case to the requirement specifications.
QA colleagues should add it to the test suite and developer colleagues should include it in the unit/integration testing.

Bugs that are in the system for a long period:

Cause of the Problem: There might be two reasons for having the blocker incident hidden for a long time:

Not being aware of the use case when the feature was developed and released.

2. The use case may not be a normal flow and could be based on an edge case.

Solution Applied: Based on the investigation, necessary steps should be taken from the BA/development/QA side.

Bugs that are not in control of Development team (non-coding issue):

Cause of the Problem: Bugs that cannot be controlled by the development team, such as infrastructure changes, database outages, performance issues, etc., should be monitored on regular basis based.
Solution Applied: Respective teams might help to implement a hotfix, however, based on the number of issues, impacted stakeholders (sysadmin team, DBA team, etc) must look for technology upgrades that might help to reduce the infrastructure dependencies.

4. Further Steps:

Based on the investigation being performed, developer colleague should take the relevant further steps to avoid the issue again happening in future.

The key parameters that should be included into further steps:

Analysis and Investigation needed.
Setup Action Items.
No Action Needed.

How to integrate RCA process into the deployment pipeline?

It’s challenging to set up a process that includes RCA into the deployment pipeline, where the team automatically completes it without any follow-up meetings or a separate process flow.

The motive of the process should be maintaining “transparency” and “documenting the information at ticket level”.

Depending on the tools being used, the best practice is to automate this process as much as possible. One of the ideal processes is to include the core components (discussed in the above section) at the ticket level.

For every bug ticket, hotline core components can be configured and added at the ticket level (Problem Category, Cause of the Problem, Solution Applied, Further steps).

Appropriate default values could also be configured to streamline the process, as we discussed in the aforementioned process flow.

The biggest challenge is: how the enforce the respective development team to automatically include it in their day-to-day process?

While there are many best practices to achieve this end result, the ideal solution is to link the core components to the ticket closure.

To explain further, any bug ticket with “Priority / Severity” as the Blocker + “Environment” as the Production — should be configured as mandatory fields before closing the ticket. Since ticket closure plays a crucial role in projecting the team’s efforts, the development team would complete the process without even thinking that there is a separate process.

Should RCA also be performed for non-blocker incident tickets during the release pipeline?

It would be an ideal approach to perform RCA for any type of bug ticket in the first place because if we streamline the process starting from unit tests, there are fewer chances of seeing a bug on production.

However, sometimes it would be overwhelming for the team to perform the RCA for all types of bugs based on various factors like number of tickets, time constraints, project release etc.

The best practice is to configure the core components of the RCA process for all types of tickets, mainly bug tickets. Depending on the team’s approach, stakeholders could perform RCA based on the priority/severity of the tickets.

Some teams might prefer to perform RCA only for high, very high, blocker bugs since moderate bugs might not impact the business.
Some teams might prefer to perform RCA for even trivial issues just to strictly avoid any type of bug being released onto production.
Some teams might perform RCA only for blocker bugs on production.

Its completely up to the team, engineering manager, and lead devs to choose the best practice based on the team’s interest and real-time dependencies.

Conclusion:

RCA Retros are always a good way to understand “how and what things went wrong” during the deployment pipeline.

Although it’s almost impossible to release a software without a bug, it’s very important to put all the possible best practices into the deployment pipeline. This process helps to avoid the bug being released to production and especially a blocker bug.

We are happy to share that these RCA best practices, configured and enforced, helped us to constantly challenge and monitor the deployment pipeline and to take necessary actions based on the feedback from the system.

Written by:

Hari Janapareddy (Product Quality Manager)
Iurii Golikov (Engineering Manager)

Legends:

RCA — Root Cause Analysis
Red Bug — Any Bug with Priority / Severiry as Blocker, Very High, High
Blocker Incident — Any Bug/Defect that causes the System / feature unusable by the end client.
QA — Quality Assurance
ATDD — Acceptance Test Driven Development
IT Sec Team — IT Security team
Sys Admins — System Administrators
DBA Team — Database Administrators