Understanding ROC AUC (part 1/2)

matt johnson
Building Ibotta
Published in
7 min readJan 31, 2019

Introduction

ROC AUC is a popular choice when evaluating the performance of a binary classifier (i.e. a machine learning algorithm that predicts an outcome of Yes/No). In this two-part article, I will derive a different interpretation of ROC AUC that doesn’t seem to be commonly known in my experience. This alternative interpretation is not just theoretical; it actually yields a faster way to compute ROC AUC than what’s typically used (e.g. scikit-learn implementation). The outline of this two-part article is as follows:

  1. Introduce ROC and ROC AUC
  2. Develop an algebraic expression of ROC AUC
  3. Interpret this expression and benchmark its performance

What is ROC?

ROC is an acronym for Receiver Operating Characteristic. During World War II, Radar was used to identify enemies — but it wasn’t perfect! It took operators (humans) to interpret the output of these receivers (the physical instruments). Some operators were better than others. Each operator had their own characteristic. But these mistakes are hard to eliminate. In fact, four cases can occur:

The receiver identifies an enemy and the operator interprets it as:

  • an enemy (True Positive or “TP” for short)
  • not an enemy (False Negative or “FN” for short)

The receiver identifies an object — that isn’t an enemy — and the operator interprets it as:

  • an enemy (False Positive or “FP” for short)
  • not an enemy (True Negative or “TN” for short)

Of course, it need not be Radar. In the abstract, suppose we are given N labeled examples where each label is binary:

0 = negative (e.g., not an enemy)

1 = positive (e.g., an enemy)

Moving from 0/1 to Scores

Suppose we construct a test to validate how good an operator is. However, instead of the operator saying enemy or not enemy, we instead request the answer be a confidence number (say on a scale of 1–10, with 10 as maximal confidence the signal is an enemy). We give the operator 7 signals and the operator reports the following respective confidences:

[6, 7, 4, 3, 4, 1, 5]

Now, suppose the hidden ground truth is respectively:

[0, 0, 1, 0, 1, 0, 1]

We can pair each ground truth label with the operator’s confidence:

[{6, 0}, {7, 0}, {4, 1}, {3, 0}, {4, 1}, {1, 0}, {5, 1}]

And use these confidences to rank the ground truth (i.e. sort by confidences):

[{1, 0}, {3, 0}, {4, 1}, {4, 1}, {5, 1}, {6, 0}, {7, 0}]

Dropping the confidence values we get our new ranked list:

[0, 0, 1, 1, 1, 0, 0]

This is alright... but the best ordering would have been:

[0, 0, 0, 0, 1, 1, 1]

Why is this ordering better? Because it implies that there exists a confidence (i.e. threshold) that entirely separates the 0s and 1s. And this threshold would minimize both false positives and false negatives.

This might be a bit confusing, so let’s explore this thresholding a bit. Given our new ranked list of:

[0, 0, 1, 1, 1, 0, 0]

Let’s arbitrarily divide the list into two parts: the first part we’ll claim to be all 0s and the second part all 1s. For example, suppose we divide this list as:

[0, 0, 1, 1, 1, 0, 0][0, 0, 1, 1] [1, 0, 0]

From this, we can compute our four numbers (try it yourself):

TP: 1

FP: 2

TN: 2

FN: 2

We could stop here. But there’s a few problems with reporting these four numbers:

(1) People often prefer reporting one type of number (not several)

(2) This method is sensitive to where we picked a threshold (i.e. a different threshold would yield a different set of four numbers)

(3) We prefer to report rates/percentages over raw counts as it’s easier to compare

Let’s address these issues.

True Positive Rate

Let’s first solve the (3) problem. What we’ll do, is take the number of TPs and divide by the number of observations that are actually positive.

This estimates the conditional probability:

A conditional probability means that the variables on the right-hand side of the ‘|’ symbol are known. So, it answers: What’s the probability under these conditions?

where..

The operators prediction
In our current example, this would be the set of index positions we can split on
The ground truth

Therefore, the true positive rate (or TPR for short) calculates:

Given a threshold and given that we know the observation is positive, the probability we’d label this observation positive.

We are free to choose this threshold, and we’ll have more to say on this in a moment. TPR is also known as recall or sensitivity.

False Positive Rate

If you think deeply, you’ll realize that TPR cannot tell a complete story. Why? Well, consider this case. Suppose in our previous example we had set the threshold so that all observations are marked as positive:

[0, 0, 1, 1, 1, 0, 0][] [0, 0, 1, 1, 1, 0, 0]

From this, we can compute our four numbers:

TP: 3

FP: 4

TN: 0

FN: 0

Now we can compute TPR using our formula:

The TPR is 1.0! But all we did was classify everything as positive. And we have a lot of false positives (in fact, the maximal possible!). So maybe we should also consider the amount of false positives too? Again, instead of considering the raw amount, we use a rate. Remember that FPs are negative examples. So we could divide our false positives by the total number of negatives examples. We’ll call this the false positive rate (FPR). Thus, in terms or probability, we are calculating:

What FPR actually estimates

Explicitly, at a given threshold, it is calculated as:

Thus, FPR calculates:

Given a threshold and given that we know the observation is negative, the probability we’d label this observation positive.

Thresholds

So we skirted around this issue of a threshold. We realize that different thresholds yield different pairs of TPR and FPR numbers. What threshold do we choose? Well, what if we tried all of them?

ROC Curve

For each threshold, compute the pair (TPR, FPR). Plot these on a graph and we have a ROC curve!

From this we can ask a few questions. First, what does best look like? Recall the best we can do is rank observations as follows:

[0, 0, 0, 0, 1, 1, 1]

If you calculate the pairs of TPR and FPR for each threshold, you’ll find TPR is always 1.0 and FPR will vary between 0 and 1. Thus, a perfect classifier will shoot straight up and go straight to the right like this:

And what if we give random scores to each? Well, in that case, we expect TPR = FPR on average. And so we’d expect to observe a line at about 45 degrees like this:

And what’s the worst case? You might think it’s just [1, 1, 1, 0, 0, 0, 0], but if an algorithm always ranked like that, we could simply sort in the opposite order. So the best we can really do is bounded between random and best.

ROC AUC
But while these curves are interesting and insightful, there is a limitation. It requires us to “look” at the ROC curve. It would be nice if we had a single number that captured the overall performance of our classifier. This is what ROC AUC captures. It computes the area under the curve. For a perfect classifier that area would be 1.0. And for random, the area would be 0.5. So, ROC AUC is bounded between 0.5 and 1.0. The closer to 1.0 — the better.

But how do we calculate ROC AUC? Using calculus, this requires computing the integral. But it turns out there’s a much better way to calculate this. A follow up article will discuss this.

Thanks for reading!

Interested in working at ibotta? Checkout https://ibotta.com/careers

Also, my technical blog can be found here: http://mattshomepage.com/

--

--