On 2017–11–03, my eight month journey into the world of open source software reached its first major milestone when I became the new maintainer of the Metrics package in R. The path I took to get there is full of fascinating statistics and the excitement of contributing to something bigger than the collection of files on my local machine. I want to tell you the story.
This article — the first in a two part series — covers the statistics of the problem at the heart of this story. I’ll provide a motivation for the problem and an intuitive explanation for understanding its solution. In the second part of this series, I’ll explain how these statistical ideas took me into the world of open source software.
Peeking Into the Rabbit Hole
In the spring of 2017, my teammate at work submitted a PR that contained a fascinating function:
When I first read over this function, I was a bit confused. After reading the reference link my teammate provided in her PR, my jaw dropped. If your jaw hasn’t dropped yet, let me try to give a quick summary of the beauty contained within this seven-line function.
Background on AUC
You can skip this section if you are familiar with AUC and the ROC curve.
Machine learning models that deal with classifying observations into one of two categories are commonly evaluated with something called AUC, which stands for the area under the curve. The curve being referenced here is the receiver operating characteristic (ROC) curve. The ROC curve plots the sensitivity of a classifier against
1 — specificity, as the decision threshold between the positive and the negative class changes. You can get the definitions of these terms from the most useful wikipedia page on the internet. Below, we can see an example of what an ROC curve looks like.
If we want to compare the performance of two different models, it can be difficult to directly compare their ROC curves. However, we can obtain a single number from the ROC curve by integrating the area beneath it. Then, we can choose whichever model has the larger AUC.
What is the function doing?
The function posted above calculates the exact AUC without calculating the ROC curve. When I saw this, I asked myself, “How does it integrate a curve without having the curve in the first place?”. At face value, it seems like some statistical voodoo is happening that shouldn’t be trusted.
I’m here to tell you that you can trust this function. Later in this blog post, I’ll talk in depth about the statistical formula it uses, but first I need to talk about why this function is useful at all.
Benchmarking Popular AUC Packages
Even though computing the AUC is a very common way to evaluate machine learning models, some of the popular packages in R that calculate ROC curves are quite slow on large datasets. For example, here is a comparison of the time it takes to calculate AUC with the ROCR package and with the function posted above.
And the results from running that code snippet.
With a data set of two million observations, the
fast_auc function is about ten times faster than the ROCR function. While 2–3 seconds may not seem like a long time, repeated uses of this function can become costly. In the PR where this function was used, my teammate was building tools that searched over different feature engineering and model building strategies. She used the AUC from each successive model to intelligently guide the search, which required computing the AUC for hundreds of models. By using this faster function, she saved a significant amount of computation time.
It’s worth saying that the biggest cause for the speedup in
fast_auc is the C++ implementation of
frank from the
data.table package. This function is significantly faster than
rank in base R. Indeed, using the
precrec package, which is also implemented in C++, to calculate the full ROC curve is actually faster than using the algorithm above with
rank instead of
frank. Thanks to Matt Dowle, Arun Srinivasan, and the
data.table team for all of their amazing work.
Looking back on this problem in optimization, an even faster strategy would have been to compute the AUC on a random sample of observations. Luckily, that solution wasn’t used in the PR. If it had been, I would have never gone down this journey.
Falling Down the Rabbit Hole
Reading the Documentation
My investigation into the statistics behind
fast_auc began with the hint left in the reference link to read the documentation of
colAUC in the
caTools package. In the description of one of its arguments I read:
Algorithm to use: “ROC” integrates ROC curves, while “Wilcoxon” uses Wilcoxon Rank Sum Test to get the same results. Default “Wilcoxon” is faster.
I knew about the Wilcoxon Rank Sum Test as a non-parametric alternative to the t-test, but didn’t recall learning about any connection with the ROC curve. Finding no relevant information within the documentation, I turned to the wikipedia page for the Mann Whitney test, which is another name for the two-sample Wilcoxon test.
Understanding the Fast AUC Function
Upon navigating to the wikipedia page, I was delighted to find a section titled
Relation to other tests with a subsection called
Area-under-curve (AUC) statistic for ROC curves.
In the first paragraph, it states:
The Mann-Whitney Test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
In the rest of this section, we will unpack the definition presented above to see how we can derive the implementation for the
Let the first sample be the predictions of observations from the positive class and the second sample be the predictions from the negative class. Calculate all of the pairwise combinations between one observation from the positive class and one from the negative class. In order to perform the Mann-Whitney test, we count the number of pairs where the prediction from the positive class is greater than that from the negative class. If the two predictions are equal, our count increases by
Since calculating all of the pairwise combinations can be slow for large data sets, the Mann-Whitney test uses rankings to quickly count the pairs. When we give a ranking to an observation,
x, it is equivalent to one plus the number of other observations less than
x, where ties count as
1/2. We add one at the beginning because of the comparison of the observation with itself.
For example, let’s say that our predictions from the positive class are
0.4, 0.7 and our predictions from the negative class are
0.1, 0.4. Then, our combined predictions are
0.4, 0.7, 0.1, 0.4. Here is how we calculate the rankings for each observation.
0.4is greater than one observation (i.e.
0.1), there is one tie with another observation, and we add one for the comparison with itself, the rank is equal to
1 + 1/2 + 1 = 2.5.
0.7is greater than three observations. Therefore, the rank is equal to
3 + 1 = 4.
0.1is greater than zero observations. The rank is equal to
0 + 1 = 1.
- The rank for the second
0.4is identical to the first
If we add up the rankings from the observations of the positive class, we get the number of pairwise combinations of an observation from the positive class with another observation from either class where the first observation is larger. The crucial insight into efficiently calculating the Mann-Whitney test is recognizing that we simply need to take this number and subtract the counts from the pairs where both observations are from the positive class.
Fortunately, we have a formula for this. Imagine having N observations in the positive class. The smallest observation is greater than zero other observations plus one for the comparison to itself. The second smallest observation greater
1 + 1 = 2 observations. And the last observation is greater than
(N — 1) + 1 = N observations. In order to sum the numbers from
N we can use the formula
N * (N + 1) / 2.
Therefore, if we sum the rankings from the positive class and subtract
N * (N + 1) / 2, we can efficiently calculate the Mann-Whitney statistic. And, if you look back at the definition of the
fast_auc function, this is exactly what it is doing!
Rethinking our Interpretation of AUC
To be more precise,
fast_auc takes the count of the number of pairs where the observation from the positive class is greater than the observation from the negative class and divides it by the total number of possible pairs. Since the probability of an event happening is equal to the total number of times it happens divided by the total number of times it could happen, we can interpret the output of
fast_auc as providing the probability that a randomly chosen observation from the positive class is greater than a randomly chosen observation from the negative class.
This is worth restating. AUC, which is usually thought of as the area under a curve created by plotting the sensitivity and specificity of a classifier at different thresholds, has this interpretation:
AUC is the probability that a randomly chosen observation from the positive class is greater than a randomly chosen observation from the negative class.
Empowered with this new algorithm for computing the area under the ROC curve and the explanation for why the algorithm works, I was motivated to see how I could share this knowledge with the open source community at large. If you are interested in reading more about that part of the journey, stay tuned for Part 2 of this series.
However, for those readers who are mathematically inclined, there is still more to this section of the story. As soon as I understood how the code within
fast_auc provides its probability, I needed to understand how this relates to the typical interpretation of
auc. So, I clicked on a link that took me to the wikipedia page for the receiver operating characteristic curve. Under the section titled
Area under the Curve, it provides a derivation, which I recreate below.
Derivation of AUC
In this derivation, we are going to use the fact that the sensitivity of a classifier is equal to the true positive rate and that
1 — specificity is equal to the false positive rate. Furthermore, we are going to assume that we classify an observation as a positive instance if the score for the observation is greater than some threshold
T is a value between
f(T) be the probability density function of observations from the negative class at threshold
g(T) similarly for the positive class. Then, we can write the cumulative density functions for the corresponding probability density function as:
We have that the
1 — F(T) is the false positive rate at threshold
T, since it represents the proportion of negative instances that are greater than
T. Similarly, we have that
1 — G(T) is the true positive rate of a classifier at threshold
T, since it represents the proportion of positive instances that are greater than
If we let
1 — F(T) = v be the false positive rate, we can say that
T = H(1 — v), where
H is the inverse cumulative distribution function that maps a false positive rate to a given threshold.
Now, we can define AUC as an integral from
1 as follows:
Next, if we use
1 − F(t) = v to perform a change of variables, we have that
dv = −f(t) dt. So we get:
Since small values a false positive rate of
0 corresponds to a threshold of
Inf, I used the negative sign from
dv = −f(t) dt to swap the order of the integral. Next, we can use the definition of a cumulative density function to get:
I(s > t) is the indicator function for the event that a randomly chosen observation from the positive class is greater than a randomly chosen observation from the negative class.