Confusion matrix, Precision, Recall, and F1 Score Performance Metrics for Log Anomaly Detection models in IBM Cloud Pak for Watson AIOps

Authors: Fabiola Uwashema, Julius Wahidin

Fabiola Uwashema
IBM Cloud Pak for AIOps
8 min readJan 31, 2022

--

The aforementioned terms of performance metrics may be common for a data scientist; however, if you are a common engineer who wants to adopt AI Operation, these terms might confuse you. You might be more familiar with the F1 as in Formula 1 racing.

As you use tools such as the IBM Cloud Pak for Watson AIOps (CP4WAIOps), it will be helpful to understand these terms. Most AIOps tools allow you to model the IT Environment you are monitoring. Usually, the first step is to model your applications, infrastructures, or anything else important. To build the model, you use a set of data produced by the system during normal operation, the “normal” data with a good representation of a normal distribution of the log messages. Next, you need to ensure that the model stays current and good by retraining. Machine learning models grow old; even when nothing drastic happens, small changes accumulate, which can cause data to drift. So the models should re-learn the patterns and look at the most recent data that better reflects reality to stay up to date. Data drift is a major reason why model’s performance decreases over time. And it is very crucial to assess how good your algorithm is performing. And this is where these terms become essential, they measure the performance of your models. One of the easiest ways to measure performance is accuracy.

Beyond Accuracy

Imagine trying to detect very rare brain tumor in patients that only happens to 1 in 100,000. By default, you could predict “no brain tumor” for every person and be 99.9% accurate of the time. However, your model would not be useful. Given imbalance data, assessing performance based on only accuracy is not enough — this is known as the “accuracy paradox,” and choosing more intelligent metrics to evaluate your models is very critical.

Confusion Matrix

Let us explore some of the key terms valid for the discussion. We wrote this blog after the Covid-19 pandemic. Many people start to go back to work at the office, which includes you for this scenario. You learned that your office provides a Rapid Antigen Test facility at the office, so you visited the place. You saw four people queuing in front of the facility. You made an educated guess based on your observation that 2 out of the four people will have Covid. So here is your prediction, the two on the left are positive, and the two on the right are negative.

However, after the test, the result is like this: from left to right: positive, negative, positive, and negative, as shown in the following diagram:

So some of your predictions are correct, and some are not. Based on the type of your prediction (positive or negative) and the outcome of the test, we can use the Machine Learning nomenclature to label them, as follow:

- Positive prediction + Positive outcome = True Positive (TP)

- Positive prediction + Negative outcome = False Positive (FP)

- Negative prediction + Positive outcome = False Negative (FN)

- Negative prediction + Negative outcome = True Negative (TN)

Note the outcome here means the actual value and the predicted is what your model forecasted.

For the above covid use case, we have used a binary classification. We can extend this classification into a 2 X 2 matrix like this:

This matrix is the basis of the Data Scientist “Confusion Matrix”. In our Covid case, we paid particular attention to the positive result, as the person then needed to be isolated and provided medical attention. The diagram above highlights this positive attention. The blue row shows the positive prediction, and the green column shows the positive results. We will come back to this later when we talk about the F1 score.

So here is the generalized confusion matrix:

We have mentioned earlier that a Data Science process builds a model of your environment using normal data, and then you monitor the environment ingesting current data, predicting an output. Depending on the correctness of the result, you can then create a confusion matrix to ensure your model’s performance. A perfect model will only have the True Positive and True Negative, the green area on the matrix above, and minimize (up to zero) the False Positive or False Negative, the red area.

Log Anomaly Detection

CP4WAIOps has a feature called Log Anomaly Detection. The tool will analyze your normal log and build a model. You can then use the model to analyze current data looking for an anomaly. An anomaly does not always mean bad, but it alerts you of the beginning of a deviation.

In some of our implementation, this anomaly detection led the team to locate potential issues and fix them before they became a problem. It allows incident avoidance, the holy grail of operation. However, to be helpful, a model must be accurate. We can see the model’s accuracy by creating the confusion matrix, ensuring that even the AI does not confuse itself.

There are two log anomaly detection AI algorithms in CP4WAIOps, each of which can run independently. The two algorithms are Natural language and Statistical baseline. The Natural language log anomaly detection uses natural language techniques on a subset of your log data to discover abnormal behavior. The Statistical baseline log anomaly detection uses a statistical moving average on all of your log data to discover abnormal behavior. This algorithm automatically detects unusual patterns in logs and notifies you when they occur. Data that is used for analysis is updated every 30 minutes so this algorithm provides value quickly.

If both algorithms are enabled, then any log anomalies discovered by both will be reconciled so that only one alert is generated. In this case, the severity of the combined alert will be equal to the highest severity of the two alerts.

Site reliability engineers (SREs) and other users responsible for application and service availability are able to display log anomalies as alerts within the context of a story.

In our previous Watson AIOps customer engagement, I have used a confusion matrix to validate the performance of log anomaly detection, and the customer loved it.

F1 score

Earlier, in our Covid Confusion Matrix, we draw a green column and a blue row. This row and column are important. Before the covid test, we try to avoid the people in the blue row, as we suspect/predict that they have Covid. After the test, the green column becomes important, and now we try to stay away from them with positive results. It turns out that Van Rijsbergen describes this reliance on positive prediction and positive results in his book Information Retrieval (2nd ed.). He introduced the E score; how it became an F score; nobody knows.

As you make a guess, you will be interested in knowing how good your guess is. As this is about Covid, you are interested in the positive. You want to know how precise your prediction is. In other words, from all the positive guesses, what is the percentage that you are correct? In Machine Learning terms, this is called precision. The focus is your guess.

Precision

As precision is a ratio with the [All positive] as the denominator, the maximum value is 1. So a good precision is one, and a bad one is 0, but normally you will have anything in between.

Another focus is the positive test outcome.

In the Covid case before, we might be interested in the percentage of correctness in guessing the actual covid test result. In Machine Learning terms, this is called a recall.

Why is it termed as a recall? There are many explanations, but like the term F1, which is counterintuitive, I guess the inventor is running out of sensible English words :D.

Now we can define the F1 score. F1 score is the harmonic mean of precision and recall.

What is a harmonic mean, you might ask? Without going into the mathematics, harmonic mean, compared to the other mean such as arithmetic mean and geometric mean, provides a more accurate calculation of rate or ratio. For example, if you walk from point A to B with the speed of 6 km/h, then walk back to A with 2km/h, what is the average speed? The harmonic mean of 3 km/h provides the correct average speed rather than the arithmetic mean of 4 km/h.

As precision and recall are ratios, harmonic mean gives a more accurate average, remember the walking example, harmonic means calculate a better average ratio, and this can be represented by a formula like:

As precision and recall range between 0 and 1, the F1 score also ranges from 0 to 1. So, if you have perfect precision and recall, your F1 will be 1. But even Nostradamus cannot predict with a score of 1.

As a note, F1 balances the weight on both recall and precision. We say that F1 has a beta value of one. If you want to give a different weight between precision and recall, you can calculate other F values such as F0.5 and F2. A smaller beta value, such as 0.5, gives more weight to precision and less weight to recall, whereas F2 with a larger beta value, such as 2.0, gives less weight to precision and more weight to recall while computing the score. However, we will not go into more details here.

As the F1 score is a number, we can easily compare different models’ performance. For example, one of our customers developed an in-house log anomaly model. During our demonstration of CP4WAIOps, the F1 score has been instrumental in comparing the models’ performance. And, of course, Watson AIOps performed better! :D

Summary

We have introduced the Confusion matrix, Precision, Recall, and F1 Score. We have also briefly described how these concepts are useful in our customer engagement of demonstrating the performance of CP4WAIOps — log anomaly detection. However, we have not covered the detailed steps of calculating the False Positive, False Negative, True Negative, and True Positive, which use the “Ground truth data .” That might be the topic of my next blog!

--

--