AUC-ROC metric for Classification problems explained with example

Moving beyond Accuracy and F1 score

Published in

Data Science in your pocket

5 min readJun 12, 2024

Amid this Generative AI season, it’s always important to understand the basics and not get swayed by the trending topics in Machine Learning. Hence, in this post, I will be talking about a highly underrated and useful metric, AUC-ROC, for the most baseline & common Data Science problem i.e. Classification.

Assuming you must have worked on at least one or two classification problems in your career, you must have used Accuracy, Precision,Recall or F1 score to evaluate your model performance.

AUC-ROC? No?

Usually less talked about, I had never used AUC-ROC in any of my projects until recently and eventually understood the importance of this metric and why it can be given a higher importance than others in some classification cases.

Note : This post assumes you’re familiar with terms like Accuracy, Recall, Precision, F1

What is AUC-ROC?

https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:Roccurves.png

The ROC (Receiver Operating Characteristic) metric is a graphical plot used to assess the performance of a binary classifier. It shows the trade-off between the true positive rate and the false positive rate across different threshold settings.

True Positive Rate (TPR): It’s the proportion of actual positives correctly identified by the model.
False Positive Rate (FPR): The proportion of actual negatives incorrectly identified as positive by the model.

A ROC curve plots TPR (y-axis) against FPR (x-axis) for various threshold values. The Area Under the ROC Curve (AUC-ROC) is a single scalar value summarizing the performance of the model.

Example

Assume a classification model for detecting a rare disease

True Positive Rate (TPR): The percentage of sick people correctly identified as having the disease.

Imagine 100 people actually have the disease, and the test correctly identifies 90 of them. The TPR would be 90%.

False Positive Rate (FPR): The percentage of healthy people incorrectly identified as having the disease.

Imagine 1000 healthy people take the test, and the test incorrectly identifies 50 of them as having the disease. The FPR would be 5%.

Hence,

If the TPR is high, the test is good at catching those with the disease.
If the FPR is low, the test rarely mislabels healthy people as sick.

The ROC curve shows how TPR and FPR change with different test thresholds. A perfect test has a point at the top-left corner (100% TPR, 0% FPR). So, a good test will have a high TPR (catching the sick) and a low FPR (not scaring the healthy). An AUC-ROC=0.5 is considered similar to Random Guessing.

Ohk great. But

How is it different from metrics like Accuracy & F1?
When should I use AUC-ROC compared to Accuracy or F1?

Let’s get back to our example of disease detection

Scenario

Disease Detection: You’re developing a test to detect a rare disease.
Data: You have 1,000 samples, where 50 are positive (disease present) and 950 are negative (disease absent). Hence, this is an imbalanced dataset

Models:

Model A: Predicts all samples as negative.
Model B: A classifier with some ability to distinguish between positive and negative cases.

1. Accuracy

Model A
— True Positives (TP) = 0
— True Negatives (TN) = 950
— False Positives (FP) = 0
— False Negatives (FN) = 50
— Accuracy= 95%

Model B
— True Positives (TP) = 30
— True Negatives (TN) = 920
— False Positives (FP) = 30
— False Negatives (FN) = 20
— Accuracy=95%

Both models have the same accuracy, but Model B is actually doing a better job at identifying the disease. Hence, accuracy isn’t a good metric in this case.

2. F1 Score

Model A
— Precision = 0 (no true positives)
— Recall = 0 (no true positives)
— F1 Score = 0 (undefined or zero because both precision and recall are zero)

Model B
— Precision = 0.5
— Recall = 0.6
— F1 Score= 0.55

Model B has a non-zero F1 Score, showing it performs better than Model A by a huge margin, but still a few key points are missing

Is the threshold used by default (i.e. 0.5) the best?
Overall, how well is the model ? F1 score can give you an idea about the model on a particular threshold only.

3. AUC-ROC

Model A: Since it predicts all negatives, the ROC curve would be a straight line with an AUC of 0.5 (like random guessing).

Model B: Since model B has some capabilities of distinguishing between +ve and -ve, assume the AUC-ROC to be >0.5 say 0.75 for now.

Conclusion

Accuracy: fails to differentiate between the two models.
F1 Score: shows Model B is better, but it doesn’t fully capture the performance across different thresholds. Also, it doesn’t indicate which threshold is the best to use.
AUC-ROC: Clearly shows that Model B is superior in separating the positive cases from the negative ones irrespective of the threshold, making it the best metric in this context. Also, this can be extended to get the most optimal threshold to use for assigning labels called the Youden’s J Statistic.

Youden’s J Statistic

You can use the AUC-ROC curve to determine the optimal threshold by finding the point that maximizes the difference between the true positive rate (TPR) and the false positive rate (FPR), or equivalently, the point closest to the top-left corner of the ROC plot. This is often referred to as the Youden’s J statistic.

Sample code

import numpy as np
from sklearn.metrics import roc_curve

# Sample data: true labels and predicted probabilities
y_true = [0, 0, 1, 1]  # True labels
y_scores = [0.1, 0.4, 0.35, 0.8]  # Predicted probabilities

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores)

# Calculate Youden's J statistic
youden_j = tpr - fpr

# Find the index of the maximum Youden's J statistic
optimal_idx = np.argmax(youden_j)

# Get the optimal threshold
optimal_threshold = thresholds[optimal_idx]

print(f'Optimal Threshold: {optimal_threshold}')

And for auc-roc score,

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_true, y_scores)
print(f'AUC-ROC Score: {auc:.2f}')

With this, it’s a wrap. Hope this is useful