Novel Performance Metrics for Anomaly Detection Algorithms

Published in

Balabit Unsupervised

11 min readJun 30, 2017

Today’s post is a direct sequel to our previous discussion about potential performance metrics that can be used for evaluating anomaly detection algorithms. Several concepts explained in the prequel return here, so make sure to check it if you want to fully understand this one.

As before, our assumption is that we have the following at our hands:

a bunch of data points as our test set, each data point resembling activities made by users in an IT system,
their labels marking if a data point is either “usual” or “unusual”,
their anomaly scores that were given by the anomaly detector we aim to evaluate. The scores range from 0 to 100. The higher a score of a data point is, the more anomalous it is from the detector’s point of view.

We are going to introduce performance metrics and related methods that can be used for evaluating anomaly detectors, or scoring algorithms in general. As we have not come across these ideas before, we kind of believe they are novel. Nevertheless, we are curious to see if they resemble previous work (more on that later). So if our ideas are in any way related to other works you know about, please let us know in the comments section.

Plain old percentiles

To build our way up to the new ideas, first we think about how we can apply classic descriptive statistics, specifically percentiles, to describe the distribution of the scores of data points labeled “usual” and “unusual”.

Suppose that we have score distribution data such as those shown in Figure 1 below. We calculate the 25th, 50th, 75th percentiles of the scores of both “usual” and “unusual” data points. The values of the percentiles specified above are 24, 30 and 37 among the data points labeled “usual”; and 43, 70 and 82 among the data points labeled “unusual”, respectively. In the figure, these values are shown in the chart as marks on top of the horizontal axis. All information about the percentiles is encoded in the colors and positions of those marks; e.g., the 50th percentile of the data points labeled “usual” is around 30, as shown by the position of the blue mark tagged “p50”.

*Figure 1. Different percentiles of anomaly scores*

What does it mean that the 75th percentile of the scores given to “usual” activities is 37? It simply implies that 75% of usual activities obtain a score from the evaluated anomaly scorer that is lower than 37. Similarly, 75% of unusual activities obtain a score below 82 as indicated by the position of the red “p75” mark. Is this good news? Yes, because this means that every fourth unusual activity obtains a score that is above 82, which seems like an ideal property of the scores (i.e., high true positive rate).

Also, by seeing that the 25th percentile of the scores assigned to “unusual” activities is 43, we learn that 25% of potentially malicious events obtain a score that is below 43 — which might not be ideal if we want to avoid false negatives.

In Figure 2, the 10th and 90th percentiles are plotted. The 10th percentile of the scores of the “unusual” cases goes beyond 60 (see the red “p10” mark). Observing this makes us happy because we learn that only roughly 10% of the anomalous events are given a score below 60. This indicates that only a small portion of anomalous events are missed by the anomaly detector. However, its false positive rate might be a bit high since 10% of normal events are given a score higher than 75, as seen from the position of the blue “p90” mark.

*Figure 2. The high value of the 90th percentile of the scores of “usual” data points implies more false positives*

Using the above examples, we wanted to make you become comfortable with idea of applying percentile calculations to the anomaly scores. Moreover, we aimed to point out that percentiles are easily comprehensible metrics, with the advantage of being able to indicate high false positive or false negative rates. There are indeed interesting conclusions to draw if percentile calculations are applied to the scores of data points from both “usual” and “unusual” classes. However, wouldn’t it be nice to somehow squeeze information from both classes into one metric?

RP distance: Percentiles in performance evaluation

The simplicity and expressive power of percentiles inspired our performance metric that we call RP distance, or reverse percentile distance.

You might wonder why this metric was born in the first place. We were in need of a metric that was able to measure not only the ranking ability of an anomaly scorer (i.e., are scores assigned to anomalous examples higher than those assigned to normal examples), but also how calibrated the anomaly scores are, i.e. loosely, how close the scores obtained by the test examples are to their best possible score. (The best possible score obtained by an example labeled “unusual” is 100, and it is 0 for an example labeled “usual”.)

Our concept is based on a rephrasing of this quality as follows: how far the scores obtained by the “usual” and “unusual” examples are from each other. With the new metric we attempt to capture this distance numerically. Our main assumption is that the greater this distance is, the closer the score of each example is to its best possible score (on average).

The RP distance at a given percentile p, or RP@p in an abbreviated format, is defined as:

RP@p = perc(dᵤₙᵤₛᵤₐₗ, 100 — p) — perc(dᵤₛᵤₐₗ, p)

Where:

dᵤₙᵤₛᵤₐₗ refers to the score distribution of the data points labeled “unusual”
dᵤₛᵤₐₗ refers to the score distribution of the data points labeled “usual”, and
perc(d, p) refers to the pth percentile of a score distribution d.

Let’s illustrate how this works right away with a few examples. First we calculate RP@60.

*Figure 3. The RP distance at percentile 60 is 21 (p40 ≈ 53, p60 ≈ 32).*

The value of RP@60 associated with the score distribution plot in Figure 3 is approximately 21. This comes from the difference between the 40th percentile of the scores of “unusual” data points, which is about 53, and the 60th percentile of the scores of “usual” data points, which is about 32. By distracting 32 from 53, we compressed two numbers into one: 21.

Probably the easiest way to explain what 21 means in this case is the following. Using the evaluated scoring algorithm, the top 60% of scores obtained by anomalous activities are higher than the bottom 60% of scores obtained by normal activities by at least 21. As always, whether such performance is acceptable or not depends on the exact use case.

For comparison, we show another example in which the value of RP@60 is 59 — almost three times as big as in the former example. This value in itself strongly suggests that the distributions of the scores of “usual” and “unusual” data points are separated by a larger margin than previously. Consequently, the score of each test example should be, on average, closer to its ideal value than before. The corresponding diagram, Figure 4, visually proves this.

*Figure 4. The RP distance at percentile 60 is 59 (p40 ≈ 69, p60 ≈ 10)*

RP distance has one parameter, p. RP distances calculated at different percentiles all have their own meaning, so it is a good idea to compute several of them. That is actually what we do while evaluating our anomaly detection algorithms to learn more about performance. In the following example, we are curious about RP@50, RP@70 and RP@90 as well. The result is illustrated in Figure 5.

*Figure 5. RP@*p decreases by increasing p

We can learn a couple of things by observing this example. Apparently, the values of differently parameterized RP distances can greatly vary even if the score distribution data is the same. While RP@50 seems reasonably large, RP@70 is much smaller, and so is RP@90. You might catch a trend here: if we increase p, RP@p decreases. This actually follows from the definition.

If p is high enough, RP@p can even drop below 0, as it happens in the case of RP@90. What can we make of this? If RP@90 is below 0, it means that we cannot select 90% of anomalous activities and also 90% of normal activities in such a way that all of the selected anomalous activities obtain a score higher than those of all of the selected normal ones.

RP curve

Suppose that RP@p is calculated for each p in [0, 1, …, 100]. In order to grab all the information these differently parameterized metrics contain, we could put all of them on top of the score distribution plot just like in Figure 5. Unfortunately, the plot then would be too crowded to make sense. Instead of this, how about visualizing these RP distances in a different chart designed specifically for this purpose?

This is what we call an RP curve. In the RP curve of a score distribution, RP@p is shown as a function of p for each p in [0, 1, …, 100].

You can follow how the RP curve is constructed in Animation 1. Given a score distribution, we calculate RP@p for each p. At each step, RP distance calculation is shown in the left, while in the right, you can see the RP curve plotted consisting of the RP@p values computed so far.

*Animation 1. Construction of the RP curve*

The decreasing trend of RP@p values as p increases has become visible. At a certain point, RP@p becomes negative. This point p (which in this example is 83) is marked by the intersection of the RP curve and the red horizontal line, which is just a visual aid for this purpose. At the end of the animation, the whole RP curve is visible, and it becomes easy to get an overview of the values that the differently parameterized RP distances take.

The score distribution shown in Animation 1 is pretty much ideal. Apart from a couple of false negatives, the scores of the data points labeled “usual” are, on the whole, far from the scores of the data points labeled “unusual”. You might wonder, however, about how the RP curve changes as the underlying score distribution changes for the better or for the worse. We developed Animation 2 to be able to examine this.

*Animation 2. Concurrent changes in score distribution and RP curve*

In Animation 2, you can see several score distribution plots and RP curves. They were produced by various anomaly scorers ranging from very good to very bad quality, including many transitions.

During this experiment, the curve is only shifted and not bent because only the mean of the distributions were tampered with and not their variance. Nonetheless, this is not what we want to focus on right now. Instead, we should see that the more ideal the score distribution is (i.e., the higher the scores that the data points labeled “unusual” obtain as compared to the data points labeled “usual”), the greater the different RP@p values are, as visible in the plot.

An implication of greater RP@p values is a larger area under the RP curve. Simply put, the area under this curve is a kind of summary of all the possible RP@p values, so we have finally arrived at one performance metric based on percentiles, the area under the RP curve or RP-AUC. (We think of this area as relative, that is, showing what percentage of the whole area in the box is under the curve.) What is there to know about this metric?

RP-AUC as a performance metric

Some basic features of RP-AUC are as follows:

Its value is between 0 and 1; the greater, the better.
0.5 is what a scorer that outputs constant scores achieves. 0.5 is also the performance value of a scorer that outputs random scores.

These features apply to ROC-AUC as well. But RP-AUC, as opposed to ROC-AUC, is a metric that also takes the magnitudes of scores into account, not only their ranking. In other words, this metric captures the difference between the actual and best possible value of each score. Animation 3, which should be familiar from the prequel of this blog post, supports this claim. This feature, as stated there, is an important one in our security analytics use case.

*Animation 3. If the separation is perfect, ROC-AUC is always 1, but RP-AUC may vary*

Speaking of the previous blog post…

You might also recall that the pAUC method discussed there is another performance metric that reflects this aspect of an anomaly scorer. Based on this, a couple of questions naturally follow. How do RP-AUC and pAUC differ from each other? In other words, what was the gain from developing RP-AUC over using pAUC?

The latter is easier to answer. Quite simply, RP-AUC was defined way before we came across the paper by Ferri et al that defines pAUC. That’s why we constructed our own performance metric that is able to meet our requirements. The methods of RP distance and RP-AUC were the outcomes of these efforts, and we were satisfied with their capabilities.

What do we know about the difference between pAUC and RP-AUC then? Actually, not much effort has been made in this regard yet, so for now this remains a mystery. What is currently sure, after having conducted a couple of simulations and experiments, is that the values produced by these two methods are equivalent for practical purposes. So far, we lack proof as to whether RP-AUC is identical to pAUC or not.

Summary

In this blog post we have introduced a couple of new ideas and performance metrics that help us in the evaluation of anomaly detection algorithms. They are, to the best of our knowledge, fresh approaches that can be easily used outside of the security analytics domain.

RP distance, a metric based on percentile calculation, expresses how far certain percentiles of anomaly score distributions (or probability distributions) are from each other. With the help of it, we can make claims like “Using the evaluated scoring algorithm, the top 20% of the scores obtained by anomalous activities are higher than the bottom 20% of the scores obtained by normal activities by at least 75.”

Differently parameterized RP distances signify different aspects of the underlying distributions. Each of these performance metrics holds value in itself. To be able to get an overview of all of them, we visualized these differently parameterized metrics in one plot, which we called RP curve.

The area under the RP curve, or RP-AUC, which compresses all RP distances into one number, can serve as another performance metric for evaluating scoring algorithms. RP-AUC is advantageous over ROC-AUC in a sense that it takes the magnitudes of anomaly scores (or, loosely speaking, score calibration) into account.

RP-AUC produces similar results as the performance metric called pAUC by Ferri et al. Further connections between the two works are to be explored.

Originally published at www.balabit.com on June 30, 2017 by Árpád Fülöp.