Root Cause Analysis in Enterprise Networks

Root Cause Analysis (RCA) is a method of problem-solving used for identifying the root causes of faults or problems.

vidya sivaraju
Brillio Data Science
5 min readMay 25, 2022

--

Networks today have evolved quickly to include business critical applications and services, relied on heavily by users in the organization. In this environment, network technicians are required to troubleshoot more complex issues. It has been observed a significant time is spent in troubleshooting network issues, finding the root cause and then fixing those.

We developed a solution strategy to address the network faults using predictive modeling techniques to do the root-cause analysis in real-time helps to remediate various root-causes using a sequence of steps that can be fully automated leading to accelerated issue diagnostics by Network operator.

Challenges to identify root cause in network:

1. Establishing root-cause is challenging because the rules tend to be unique to each network.

2. Each vendor, topologies, interface specs, etc. tend to be quite different, so they need to be customized to each network.

This can be solved by Hierarchical root-cause analysis:

Hierarchical root cause structure is derived from the process, which is followed by network operators to resolve network faults. Development of hierarchical Models for traversing via shortest path to reach correct Root cause identification model.

A Stepwise Solution Approach:

ML models can be trained to detect cause of network faults with high degree of accuracy. Typically, historical log messages are used for training in such cases. Guided remediation steps can be provided to either the end user or to network operator to resolve the issue.

Workflow:

Model Workflow

1. Unstructured Data Generated at Source: — Raw or unstructured syslog are captured on network devices

Unstructured syslog data view:

Unstructured Data

2. Converting Unstructured to Structured Data: Preprocessing of the unstructured data into structured data by passing it through a parser function which capable of collecting relevant attributes — Message, auth method and error codes.

Structured syslog data view:

Structured Data

Code Snippet

Text Data cleaning:

Code for Text Data cleaning

3. Data Pre-Processing

a) The unbalanced data is converted into balanced data using synthetic data generation techniques like SMOTE, TLink.

b) Vectorization — Categorical and textual data are converted into vectors using Count vectorizer, FastText, onHotEncoder. A pipeline is developed to vectorize Text message and error codes into vectors for input to Classifier.

4. Model Building and Tuning and Evaluation

After getting the preprocessed data from the pipeline below steps are performed to build and tune models:

Model Training is done using Random Forest Classifier model, hyperparameter tuning is executed with GridSearchCV and evaluation using local classification matrix, ROC/AUC curve, MLFlow is integrated AWS Sagemaker Notebook to track and monitor model artifacts.

MLFLOW Experiments

5. Model Validation

For model validations, the concept of customized Confusion Matrix is utilized i.e., Global Confusion Matrix, which helps in comparing the predicted class to the actual class. The current model development consists of developing models in a sequence of use case basis. This leads to change in the number of root cause at a certain point of time and comparing all the developed models with each other using one vs all comparison.

Constant checks are needed to identify the existence of False Positives (Falsely predicted positive class) as this can cause unrelated remediation actions executed on the network. After every model development exercise, a Global Confusion Matrix is updated by running all the existing and newly developed models on an exhaustive test data. It helps in analyzing the Accuracy, Precision and Recall metrics. If there is any change in the False positives for a certain model, it is considered for retraining and resolving the issue.

For resolution of False positives, the following steps were followed:

a) Check for Data Leak: If training was carried out with same negative class in two models, in current case successful connection syslog, then this might create a data leak i.e., prediction of False positives for similar syslogs. Remediation for this is to make tweaks into the negative class and addition of irrelevant syslogs.

b) Existence of similar syslogs for different RCs: Since the syslogs are standard format messages so it might be almost similar for two root causes with differences due to few words. So, to remediate the problem retrain the model with addition of similar syslogs from different RCs into the negative class. This will help the model to understand the difference between the two root causes.

c) Addition of irrelevant class in a multi-class model: Multi-class models comprise of multiple positive classes and if provided with a syslog from another root cause, it can classify it into one of its root cause classes. This leads to unnecessary false positive predictions. Remediation for such a scenario requires providing all the root cause syslogs except its own into a new irrelevant syslog class which acts as a negative class for the multi-class model.

Executed all the model validation steps in AWS Sage Maker notebook and saved the Global confusion matrix in S3 bucket.

Global Confusion Matrix

Core principles that guide the methodology of root cause analysis:

1. Focusing on correcting and remedying root causes rather than just symptoms.

2. Providing enough information to inform a corrective course of action.

3. Considering how a root cause can be prevented in the future.

4. The realization that there could be more than one root cause.

Thank you Asit and Hitendri Bomble for your contribution.

--

--