A Beginner’s Guide to ROC Curves and AUC Metrics.

Published in

The Startup

6 min readSep 21, 2020

Blog Structure

As can be seen in the title above, the purpose of this blog is to gain a basic but strong fundamental knowledge of ROC and AUC and how they relate to a binary classification model. The order of this blog is the following.

Define what an ROC curve is and what it means
Define the parameters of an ROC curve and gain an understanding of parameters
Define what AUC means and how it relates to ROC
Apply Lessons learned to a simple Kaggle Heart Disease Dataset to further understanding

So what are ROC and AUC anyway?

To put it simply, ROC(receiver operating characteristic curve) and AUC (area under the curve) are measures used to evaluate performance of classification models. ROC is a graph that shows the performance for a classification model at all unique thresholds. The graph uses the the following parameters on its axes:

True Positive Rate
False Positive Rate

As you can see in the graphs above, the errors of our classification model are dependent on our threshold selection. Ideally, if we have a threshold and there is no overlap between the Red curve (positive class) and the Green curve (negative class), then our model would be able to perfectly distinguish between the two classes. Thus, we would eliminate type one error (false positive) and type two errors (false negative).

However, in real world examples, this is very unlikely. In real world examples, there is a tradeoff between type 1 and type II errors. As you can see in the first graph, we can increase our threshold to decrease our false positive count. Thus, decreasing our type 1 errors but at the same time we are increasing our count of false negative results. In other words, more type II error. The opposite would happen if we instead decreased our threshold.

ROC Parameters Definitions

To understand ROC further, let’s define its parameters. True Positive Rate is a synonym for recall. It is a ratio of the true positives to all positives. It ranges from 0 to 1. Thus, TPR can be thought of as a measure of how well the model is identifying the positive class.

The FPR or false positive rate is a measure of how often there are incorrect classifications of the positive class. In other words, when the actual class is negative, the FPR informs you how often the model misclassifies the class as positive.

Area Under the Curve (AUC)

AUC represents the entire two dimensional area under the ROC curve. AUC can be thought of as a probability curve. Another way to define AUC is to consider it as a measure of separability. You can also think of AUC as the likelihood that your classifier will assign a higher predicted probability to the positive observation if you randomly selected one positive and negative observation.

AUC can range from 0 to 1. An AUC value of 1 would be a perfect score meaning absolutely no misclassifications. An AUC score of 0 would mean the exact opposite. In other words, All the classifications would be wrong, meaning 100% of our classifications would be the exact opposite of what they should be. So in this case, we would be classifying 100% of the positive class data points as the negative class or vice versa.

AUC Advantages

There are significant advantages to using AUC as well. Those are the following:

AUC is scale-invariant. This is because AUC is based on how well predictions are ranked not their absolute values. Thus, transformations of predictions that do not affect relative ranking will not affect AUC.
AUC is classification-threshold-invariant. As mentioned before, AUC will give you a value for any set of thresholds.
AUC is useful even when there is high class imbalance.

However, AUC is best used when trying to minimize both false positives and false negatives. It is not useful when the cost of minimizing these errors are not equal. For example, AUC is not useful for a spam email classification.

In this case, minimizing false positives would be the priority. A false positive in this case would be marking an email as spam, when it is not. As you can imagine, You would not want all your important emails being sent to spam. However, a false negative in this case would be marking an email as not spam when it is. Of course, this is not ideal either but this result would be less detrimental. Since, removing spam emails from your inbox would be easier than possibly searching through your entire spam folder.

Thus, instead of AUC, an F1 score which uses precision and recall would be a better metric.

Heart Disease Example

To get a better understanding of ROC and AUC in classification models, let’s look at an example. I found a heart disease dataset on Kaggle and I made a simple classification model to further our understanding of all these concepts.

The dataset consists of about 10 features about patients’ physical health.These features include data such as age, sex, blood pressure, cholesterol levels, etc... . The target variable is whether the patient suffers from heart disease or not. So in this example, the positive class will be a patient with heart disease.

Left Figure:ROC curves for Logistic Regression and Decision Trees Right Figure: Confusion Matrix for Logistic Regression model (better performer)

The figures above display the results from our testing data from the heart disease dataset. The right figure is a confusion matrix where the diagonal from top left to bottom right are how many predictions we got exactly right. The other diagonal represents our number of type one and type two errors. The left graph contains ROC curves for the two models used and it shows how different thresholds affect our models’ performance. The better model turned out to be the Logistic Regression model and it evaluated to an AUC of .87 versus .75 for the Decision Tree Model. Using cross validation with k=10 on the Logistic Regression, I obtained a AUC score of .9 confirming our original score’s significance.

Conclusions

Please note in this case we assumed that false positives and false negative’s are equally important. However, in practice, This may not be the case as a doctor may care more about Type II Errors than Type I errors. Since, in this case a Type II error would result in a patient being unaware of their serious illness. While a Type I error would lead a patient to believe they have heart disease when they actually do not. Thus an F1 score would be a better metric to use if Type II errors are prioritized. This model received an F1 score of .82 which seems acceptable but I would like to do more testing to see if that can be improved further.