Performance Metrics for Anomaly Detection Algorithms in Security Analytics

Unsupervised Blog
Balabit Unsupervised
10 min readMay 4, 2017

In our previous post we discussed the concept of evaluating anomaly detection algorithms that operate in the security domain. As we have seen, this task usually involves performance assessment without labels, yet we also showed a couple of ways of artificially creating labels.

In this post we are assuming that we already have:

  • a bunch of data points as our test set, each data point resembling activities made by users in an IT system,
  • their labels marking if a data point is either “usual” or “unusual”,
  • their anomaly scores that were given by the anomaly detector algorithm we aim to evaluate. The scores range from 0 to 100. The higher a score of a data point is, the more anomalous it is from the detector’s point of view.

Performance evaluation is about taking all test cases and comparing their label with the given anomaly score. The output is a statement about the quality of the anomaly detector that produces the scores.

The question we intend to answer is what metrics are best suited for the evaluation of anomaly detectors in our field. Along our journey through potential solutions, we will also build the list of requirements for what makes a good anomaly detector.

Eyeball evaluation

People are naturally skilled at interpreting visual signals quickly, so the easiest way to make a judgement of the performance of a detector is by taking a look at the anomaly scores and the labels of the data points in the test set.

For this purpose, we create histograms to examine the distribution of the scores. We can actually put two overlaid histograms in the same plot, one describing the “usual” points and another one showing the “unusual” ones.

First consider Figure 1.

Figure 1. Score distribution plot indicating false negatives

This plot shows that most of the points labeled “usual” get an anomaly score between 20 and 40. Most of these scores are lower than those of the “unusual” ones, which is an ideal property. Having said that, there are also some “unusual” points whose score is also in the 20–40 range or even below that. In our security context, those are the false negatives, that is, the user activities that are abnormal and potentially malicious but left undetected by the scorer (i.e., the anomaly detector).

Figure 2. Score distribution plot indicating false positives

The histograms in Figure 2 shows different characteristics. A lot of data points tagged ”usual” have a score below 20, and most of the “unusual” data points have a score between 60 and 80. Again, the “usual” and “unusual” points are more or less well separated by their scores. However, we see another type of mistake occurring. Some “usual” activities have anomaly scores above 80. These are false positives; activities that are seen as anomalous by the anomaly detector despite being normal, ordinary activities.

Figure 3. Score distribution plot indicating the inability of the anomaly detection algorithm to identify “usual” and “unusual” data points correctly

Finally, Figure 3 shows evidence of unsatisfactory performance by an anomaly detection algorithm. A great deal of both false positives and false negatives are visible; and it is clear that the algorithm is unable to do the job of giving higher scores for “unusual” activities than for “usual” ones.

A lot of performance attributes can be derived through looking at these histograms, including where the average anomaly score falls for both classes (“usual” and “unusual” ones), the ability of anomaly scores to clearly separate usual activities from unusual ones, or the level of false positives and false negatives.

In this respect, histograms are fantastic because they are simple to create and interpret, and contain all of this information. But of course we also need to capture the above-mentioned attributes using quantitative metrics, to gain the ability to compare algorithms automatically, or the benefit of being alerted if the algorithm’s performance goes below an acceptable level. In the rest of this post, we will describe metrics that are capable of this.

A Usual Companion: AUC

Let us start with a popular metric that is designed to represent the ranking capability of scorers, that is, to enable us to answer the question: “to what extent can we say that “unusual” data points get a higher score than “usual” ones?”. It is the well-known Area under the ROC Curve, or ROC-AUC — simply AUC from now on. (There are other metrics capturing similar qualities as AUC, such as the closely related Gini coefficient, or the Kolmogorov-Smirnov (KS) statistic, but we are not dealing with them here.)

We will not go into the details of AUC and how the famous ROC curve is constructed as there are many sources discussing it, such as Fawcett’s paper. Within the limits of this post, it should be enough to recall the one-line interpretation of AUC (quoted from the referenced work):

[the] AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

For us, it practically means that if we randomly pick a “usual” and an “unusual” user activity from our test set, the “unusual” one will have a higher score with a probability that equals the AUC. This is essential in our product, since we want to make sure that abnormal events get highlighted with an anomaly score larger than that of normal events.

The AUC value of an anomaly scorer’s performance ranges from 0 to 1. An AUC of 1 indicates a flawless anomaly scorer that perfectly separates the two classes (“usual” and “unusual” events in our case). If the AUC is below 1, that means that some “usual” events have larger scores than “unusual” ones do. Although not being perfect, an AUC of 0.7–0.9 is considered acceptable, like the one in Figure 4.

Figure 4. The very same histogram as in Figure 1 but now tagged with the corresponding AUC value

On the other hand, if the AUC is 0.5, the algorithm is as effective a classifier as random guessing, which is obviously of not much use. Below 0.5, it is even worse.

To aid the understanding of the connection between the score distribution plot and the AUC value, we have developed Animation 1. Several histograms are presented there; from the score distribution of a flawless scorer to that of the worst possible one, including many transitions between these two extremes.

Animation 1. Connection between score distribution plots and AUC values

Splendid! It is now apparent what kind of performance results in a high or low AUC value.

It seems now that with AUC at our hands we are able to tell whether the anomaly scores provided by an algorithm can distinguish between “usual” and “unusual” events. However, is that enough by itself? You can form an answer by studying Animation 2.

Animation 2. Perfect separation always yields an AUC of 1 no matter how big the distance is between the two classes

There are a couple of performance results presented here, all of them with an AUC equaling 1. Suppose that in each case, the anomaly scorer tried its best to give as high scores to anomalies as it was possible and the lowest scores possible for the usual events as well. Unfortunately, we cannot say that all of them have the same degree of utility for us. We are happier when the majority of the “unusual” points have scores around 90 than around 60. Similarly, it feels better to see the scores of the “usual” points at around 10 than around 40.

It all comes down to what what we want to provide to our users, the security experts. It is not just the ranking of the events happening in their IT system but also scores associated with the anomalousness of the events. It is perceptually different for the user to see the same anomalous event with a score of 60 as opposed to a score of 90. Naturally, the latter would draw larger attention and perhaps would be dealt with more seriously. It is also to be kept in mind that these anomaly scores can be fed to other systems, and alerts can be defined based on them, so if we think of scores as potential data sources, they should be even more reliable.

Thus, it is a sensible expectation to make the scores as expressive as possible, e.g., as high as possible if the anomaly scorer is sure there is something fishy going on. In machine learning terminology, this is referred to as the scores being calibrated.

Apparently, AUC fails to encapsulate this aspect of the performance of an anomaly scorer, so we are in a need of further metrics. Let us explore them!

An extension to AUC: pAUC

We have recently come across a paper by Ferri et al. titled “Modifying ROC Curves to Incorporate Predicted Probabilities” that aims to address the above issue. The authors propose a metric they call pAUC (probabilistic AUC) as a solution.

Let us think of score 100 as the optimal (i.e., best possible) anomaly score that an “unusual” event can obtain and of score 0 as the optimal anomaly score for every “usual” event. From another viewpoint, a score of 0 is the pessimal (i.e., worst possible) score an “unusual” event can ever get, and a score of 100 is the pessimal score for a “usual” event.

After having given scores for the test cases, we can easily calculate the average distance between the pessimal and the actual anomaly score of every test case labeled “unusual”. Likewise, we compute the same average distance for the “usual” test cases. pAUC is the average of those two average distances. (Using this definition, pAUC will fall between 0 and 100, so if we want to convert it to compare with an AUC value between 0 and 1, we should divide it by 100.)

Figure 5 illustrates the idea. In this setting, the average distance between the scores of “unusual” data points and score 0 is about 61.14. The average distance between the scores of “usual” data points and score 100 is around 69.81. The mean of these two numbers, i.e., the pAUC value, is roughly 65.5. 0.655 (pAUC) is significantly lower than 0.861 (AUC). This is because pAUC is much stricter in its assessment of those test cases that yielded too many highly false negatives or false positives.

Figure 5. The very same histogram as in Figure 1 but now tagged with the corresponding pAUC value as well

Again, to have a better understanding of the workings of this metric, look at Animation 3. You can also observe the connection between AUC and pAUC.

Animation 3. The correlation between AUC and pAUC

It seems that these two metrics correlate highly: as one increases, so does the other one too. They both give random guessing a value of 0.5 (50). When measuring scoring performances better than that, pAUC is more rigorous: even if AUC is 1.0, pAUC is not perfectly happy because the scoring could have been better. It is also interesting to see what happens to pAUC when performance drops below 0.5 (50).

Now let us get back to the experiment that showed the limitation of AUC. Here is the animation of it, now annotated with the pAUC values (Animation 4).

Animation 4. Even though the separation is perfect, pAUC values vary as the distance between the two classes changes

It is clearly visible that while the score distributions shift, the AUC value is standing still, and pAUC is changing. This points out that pAUC is a metric that is able to differentiate between the performances of anomaly scorers not only by measuring the ranking power of the scores but also by taking the distance between each score from its ideal position into account.

Conclusion

The ongoing evaluation of the performance of our anomaly detectors is crucial to us. It is even more essential that the metrics used for the performance assessment of these detectors are capable of indicating those qualities of the detectors that are important in the product we are embedding our anomaly hunters into.

This implies three requirements for a performance metric, which are all crucial in our opinion for the metric to perform well in the context of security analytics:

  1. Explainable. The more people can interpret or have a sense about the performance results, the better. The results should be accessible not only for data scientists, but also for software engineers, managers, pre-sales etc. and naturally, our customers.
  2. Reveals ranking ability. Since our product wants to compile a list of events ordered by their anomalousness, we expect that the scorers assign higher anomaly scores to the events that are anomalous. The selected metrics should measure this ability.
  3. Reveals score calibration. We want to draw attention to anomalies and the magnitude of the anomaly scores plays a great role in this. If a detector catches an anomaly, it should output a score that is closer to the upper end of the [0, 100] scoring interval. Similarly, a totally normal event should have a score close to 0. A good performance metric considers this aspect. (This is just a loose definition of score calibration — this paper should be a better reference.)

ROC-AUC is perhaps the most well-known of all performance metrics but it has its shortcomings when dealing with the third requirement. Thus, relying solely on it is to be avoided in our context. However, there are other methods that are better at revealing score calibration; one of them is the presented metric called pAUC.

We are still not at the end of our journey through performance evaluation. In an upcoming post, we will introduce two concepts developed by us that we use when we need to evaluate our anomaly detection algorithms.

Originally published at www.balabit.com on May 4, 2017 by Árpád Fülöp.

--

--