Using AUC ROC to Evaluate Your Model (Part 1)

Understanding when to use AUC with an example focused on fulfilling the needs of a business

Photo by Lukas Kloeppel from Pexels

I recently participated in the Home Credit Default Risk Kaggle competition. Home Credit is a business that focuses on providing loans to the unbanked population. The competition was to predict whether applicants would successfully pay back a loan using the data provided by Home Credit. Competition submissions were evaluated using the Area Under the Receiver Operating Characteristic curve (AUC ROC).

My initial instinct was to focus on maximizing the AUC score of my model. Increasing your model by only 0.01 can bump you up quite a lot depending on where your last submission sits on the leaderboard! However, as I thought more about the project I became interested in why they had chosen AUC as the metric to evaluate the performance of machine learning models.

Part 2 of this series focuses on the details of how AUC is calculated. This post will focus on two topics:

  1. As a Data Scientist, when should you use AUC as the metric with which to evaluate your model?
  2. What exactly is AUC?
Climbing the leaderboard!

Choosing a Model Evaluation Metric

First, a quick background on this problem. We are trying to predict whether a loan applicant will default on the loan. The data provided by Home Credit is labeled, so we know which clients defaulted and which paid back the loan. This means we will be solving a supervised classification problem. After determining the type of Data Science problem you will be tackling, you should decide how you will evaluate any Machine Learning models you develop.

For this article, I generated some example data¹ (code in the footprints) by modifying the logistic function. This data is a greatly simplified version of the Home Credit data. We only have a single feature of the age of the loan applicant and the target label. The client is labeled 1 if they defaulted and 0 if they did not. Let’s graph² the target and our feature:

It looks like applicants are more likely to default if they are younger. It also seems there are more clients who repaid the loan. The next step is to determine how many of each example are in each category³:

Out of 1000 applications, 208 were not repaid. We have moderately imbalanced classes because our positive case represents 20.8% of the data. This means accuracy will be a poor metric to evaluate our model. So let’s just use recall to make sure we correctly label those default cases correctly and call it a day! But what does Home Credit value? Is it more important to them to correctly label the defaults? In that case, recall is an effective metric. Or would they rather prioritize labeling the successful loans correctly? Here precision would be a better metric. If you are shaky on the difference between these metrics I would recommend you brush up on classification evaluation and the confusion matrix.

Focusing on the Needs of a Business

The reason we may not want to use recall or precision is that we do not know how aggressively Home Credit wants to give out loans! This is an important part of their business model and if we knew their risk tolerance we could build a custom model for their needs. AUC allows us to evaluate our model across a range of risk tolerances and provide a general model that Home Credit can then tune for their business strategy.

To emphasize this point let’s look at, very generally, what a classification model is doing. It takes an example of data and calculates the probability that it falls in the positive target label (in our case this would be a defaulted loan). By default, the model assigns probability outcome of 0.5 or higher to the positive class and less to the negative class. Here is a hypothetical example:

Let’s look at an extreme example. What if having a single client default on a loan would be catastrophic to Home Credit’s business? In this case, one solution would be to lower the probability threshold to 0 to ensure that no defaulted loans would be incorrectly labeled. This would be the most conservative approach and our model would simply label every applicant as a 1 and no loans would be given out.

In this example, I am using the terms aggressive and conservative to reference the business mindset when approving loans.

In this case, the output of our model with the probability threshold set to 0 would be:

On the other hand, if Home Credit wanted to give loans to everyone then we would raise the probability threshold to 1.0 and all loans would be labeled as 0 (not a default risk).

In these extreme examples, we will clearly be mislabelling much of the data. In order to quantify how our model performs at each probability threshold, we will measure the True Positive Rate (TPR) and False Positive Rate (FPR).

The TPR (also called recall) measures how many of the positive cases our model predicted as positive.
The FPR measures how many of the model’s positive labels were incorrect and should have been labelled as negative. To calculate these values we need to use the number of True Positives (TP), False Negatives (FN), and False Positives (FP).

TPR and FPR can be calculated using the outputs of our model:

  • TPR = TP / (TP + FN)
  • FPR = FP / (FP + TN)

Going back to our previous examples, if we are extremely conservative we will label everything as a 1 (positive) and therefore we have maximized our TPR, however, we have also incorrectly labeled all of the negative cases and we have also maximized our FPR. In the most aggressive case, we have the opposite situation where our model labeled everything as a 0 (negative) and our TPR and FPR are 0 also. We can plot TPR versus FPR and see these two extremes.

The optimum threshold is the point between these two extremes that best fulfills the business model for Home Credit. Since we do not have enough information about their business we do not know whether to prioritize precision or recall and therefore we do not know where to set the precision threshold.

In order to evaluate models across the entire spectrum of probability thresholds, we will use AUC.

What exactly is AUC?

To calculate the AUC we first need to graph the Receiver Operating Characteristic. Now that we have defined TPR and FPR, we can describe the ROC. Let’s look at the ROC for a model that randomly guesses the label of data across the range of probabilities from 0 to 1. This represents a naive model that does not use any of the features of the data to make predictions.

This straight line shows how TPR and FPR both increase as the probability threshold is lowered. Calculating the definite integral under the Receiver Operating Characteristic curve from 0 to 1 will give us the Area Under the Curve. In this case, the AUC of random guessing is 0.5. Therefore, any Machine Learning model should have a minimum AUC score of 0.5.

The AUC of a model that performs perfectly would be 1.0 because it never predicts false negatives. A realistic model will be somewhere between 1.0 and 0.5.

Summary

Evaluating Machine Learning models using AUC allows you to compare performance across the entire prediction probability threshold range. This allows you to confidently provide a business with the model that performs best across the entire probability spectrum. The business can then set the prediction threshold based on their risk tolerance.

In part 2, I will cover how ROC and AUC are calculated using Python and scikit-learn.

Footnotes:

[1] Data generation:

import numpy as np
# Create an array of 1000 random ages from 20 to 70
ages = [np.random.randint(20, 70) for i in range(1000)]
# Coefficients for sigmoid function
L = 1
k = .3
x_0 = 30
s = 10 # Coefficient to determine magnitude of randomness
# For each age calculate the value from the sigmoid function
defaults = [L/(1+np.exp(-k * (x_0-age+np.random.uniform(-1, 1)*s))) for age in ages]
# Create a np array with True if the element is greater than 0.5 or # False if less than 0.5.  Then set all Trues to 1 and Falses to 0.
defaults = (np.array(defaults) > 0.5).astype(int)

[2] Scatter Plot:

import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('fivethirtyeight')
plt.scatter(ages, defaults, alpha=.1)
plt.title('Loan Defaults versus Age')
plt.yticks([0,1])
plt.ylabel('Target')
plt.xlabel('Age (years)')

[3] Bar Graph:

pos_cases = np.sum(defaults)
neg_cases = len(defaults) - np.sum(defaults)
plt.bar([0, 1], [neg_cases, pos_cases], color = ['blue', 'orange'])
plt.xticks([0, 1])
plt.title('The Number of Examples in Each Target Category')
plt.xlabel('Target Label')
plt.ylabel('Number of applicants')