Practical Insights: ROC Curves and Imbalanced Datasets

The use of widely accepted statistical measures can sometimes lead to questionable results if not applied mindfully. In this respect, especially the use or misuse of p-values has gained attention in recent years. Another metric widely used is the area under the ROC curve (AUC). Although its limitations are well-known and have been discussed especially in relation to machine learning applications, in practice it is not always applied meaningfully. In this blog post we use a simple example to demonstrate the shortcomings of the AUC metric when working with imbalanced datasets, for example when modelling probability of default in loan or mortgage portfolios.

Published in

The Startup

4 min readSep 8, 2020

The ROC curve is simply a graphical plot of the relationship between the False Positive Rate (FPR) and the True Positive Rate (TPR) when the discrimination threshold of a classifier is varied.

FPR is the probability of classifying a data point as positive when it is actually negative. This is essentially the probability that your classifier gives you a false alarm, and it is defined as follows:

where N is the total number of negatives, which is equal to the sum of false positives (FP) and true negatives (TN).

Similarly, TPR is the probability of classifying a point correctly as positive.

where P is the total number of positives, which is equal to the sum of true positives (TP) and false negatives (FN).

The area under the ROC curve (AUC) can be interpreted as the probability that the classification model correctly ranks a random positive example higher than a random negative example. So an AUC which is close to 1 is quite often considered to be a confirmation of the model being good. However that might not be the case as we will see.

Looking at the curve below, which has an AUC of 0.93, one may naively conclude that this model is excellent when classifying the underlying dataset.

But let’s take a look at an example of a dataset which could give rise to this excellent ROC curve, while the underlying classifier being of poor quality.

In the image below the red dots represent the positive class and blue dots the negative class. We can assume that we are solving a binary classification problem using logistic regression.

The threshold in our classifier can be set to different values resulting in, for example, the two different boundaries shown in the following figure:

The classifier to the right has a TPR of ca. 0.5 but a very low FPR (<0.05). The classifier to the left already has TPR of 1.0 but still a very low FPR (~0.1). When we move the boundary-threshold from right to left we classify more data points as positive, increasing the number of false positives and true positives, giving rise to the ROC curve previously shown above.

It’s important to note that for both of these classifiers we can let the FPR approach 0 and keep the TPR constant as we add more negative (blue) points on the left part of the figure. Thus we can let the AUC approach 1 as we like by making the dataset more imbalanced.

We see that by changing the threshold we would get great results when measured by the AUC. However this metric completely misses the fact that this classifier is very poor when it comes to precision:

Indeed both of the thresholds in the picture above result in a precision of less than 0.4 but this shortcoming is of course not indicated by the AUC curve. Thus, one should not blindly trust the AUC metric but to investigate other statistical measures that allow for a better judgement of the outcome of the analysis. In this example the ideal way to capture the performance would be the precision recall curve, which shows the tradeoff between precision and recall (TPR) at different thresholds.

All in all we see that it’s critical to keep in mind the meaning behind the metrics used to evaluate model performance, especially when dealing with imbalanced datasets.

Practical Insights: ROC Curves and Imbalanced Datasets

Written by sigmaQ Analytics