Performance metrics for evaluating a model on an imbalanced data set?

Published in

Data Science Story

7 min readApr 29, 2020

Accuracy, Precision, Recall F1-score and ROC=TPR+FPR, AUC score

It is always confusing for newcomers in Machine Learning to decide which Performance metrics should we use for evaluating a model on an imbalanced data set in case of Classification settings?

Even for me, it was really very challenging to find an answer to this question in particular. It was after some time into ML/AI field that, I got an answer to the above question and decided to share with who needed the same.

What you will learn from this Article?

What is a Confusion matrix?
What is Accuracy
What are Precision and Recall?
What is F1-score?
What are TPR and FPR?
What is the ROC curve?
How to interpret the AUC score in the ROC curve?
Which performance metrics to use in different situtations?
Bottom Line

What is a Confusion matrix?

It is a Matrix table (rows & column) that is used to describe the performance of a classification model in term of TP, TN, FP, FN as follows:

Let suppose we have a Cancer data-set in which we are supposed to predict based some medical report, who is going to suffer from cancer in the near future. Then TP, TN, FP, FN can be defined as:

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
true negatives (TN): We predicted no, and they don’t have the disease.
false positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

Confusion matrix often become hard to interpret when we have a multi-class classification problem

Now, Let’s get to know performance metrics based upon:

For a better understanding of the above topics, I take BinaryClassification problem (1 → positive class and 0 → negative class) with the following two imbalanced scenarios: →

The dataset in which number of positive points >> number of negatives points
The dataset in which number of negatives points >> number of positive points

Case-1 number of positive class>> number of negative class

I willfully created an imbalanced dataset (situations)to get a stronghold on the concepts.

And assume that my Classifier labeled all negative class as False Positive (FP)

import numpy as np
import pandas as pd
Y = np.hstack((np.ones((10000,)), np.zeros((100,))))
Y_score = np.random.uniform(0.5,0.9,10100)
df_imb = pd.DataFrame(data=np.array((Y, Y_score)).T, columns=['y','proba'])
df_imb = df_imb.sample(10100)
df_imb.to_csv('positive greater than negative.csv', index=False)

(𝑦𝑝𝑟𝑒𝑑=[0 if y_score < 0.5 else 1]) and then

from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(df_imb.y, df_imb.y_pred)
conf_mat
# to get my confusion matrix

Key observation → got lots of FP

What is Accuracy? → It defines how much of True Values out the total Values. Range → 0–1 (higher the better)

Accuracy score: (TP+TN)/(TP+TN+FP+FN) = 0.9900990099009901

What are Precision and Recall?

Precision → ‘’’How many belong to Actual +tive out of Total +tive predicted by a model. ‘’’ Range → 0–1 (higher the better)

Precision: TP/(TP+FP) = 0.9900990099009901

Recall → ‘’’Out of Total Actual +tive how many are predicted as +tive by a model.’’’ Range → 0–1 (higher the better)

Recall also called as True Positive Rate(TPR) or Senstivity or probability of detection vice-versa

Recall: (TP)/(TP+FN) = 1.0

What is F1-score? → “It returns the Harmonic Mean of Precision and Recall” Range → 0–1 (higher the better)

F1-score = 2 * (precision*recall)/(precision+recall)= 0.9950248756218906

What are TPR and FPR?

True Positive Rate(TPR) = Recall

False Positive Rate (FPR) = ‘Out of Total Actual -tive how many are predicted as +tive by a model.’ Range → 0–1 (lower the better)

FPR = (FP)/(FP+TN)= 1.0

What is the ROC curve?

A receiver operating characteristic curve, or ROC curve is plotted between the true positive rate (TPR /Recall) on the y-axis against the false positive rate (FPR) on the x-axis at various threshold settings.

ROC curve is used to measure how well your Classifier separated the TP and TN

How the ROC curve is drawn?

We take each probability score we calculated using LogisticRegression.predict_proba as threshold → compute confusion matrix → measure TPR & FPR (for each threshold)

ROC curve can be extended to a Multiclass Classification problem using one-vs-all approach

https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:ROC_curves.svg

The Diagonal line represents a random model that predicts either 1 or 0 randomly. Area under Diagonal line is 0.5

How to interpret the AUC score in the ROC curve?

AUC score- Area under the curve.

For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions:

Predictions ranked in ascending order of logistic regression score.

AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.

AUC provides an aggregate measure of performance across all possible classification thresholds

AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

**ROC curve for our synthetic Data-set**

AUC score: 0.4580425

Key Observations → | When the number of 1>>>0 |

Accuracy score: 0.9900990099009901
FPR: 1.0
Precision: 0.9900990099009901
Recall: 1.0
F1-score 0.9950248756218906
AUC score: 0.4580425

A. Metrics that don’t help to measure your model:

Accuracy: is very high. Even when TN = 0. Since the data is imbalanced (high number of +tive class). Numerator i.e TN+TP is high
Precision: is very high. Since data has a very disproportionately high number of Positive cases.
The ration of TP/(TP+FP) becomes high.
Recall: is very high. Since data has a very disproportionately high number of Positive cases.
The ration of TP/(TP+FN) becomes high.
F1-score: is very high. The high values of Precision and Recall make F1- score misleading.

Precision and Recall basically deal with Positive class. And when the data-set inherently has lots of positive cases, Precision and Recall seem to be not good metrics to measure the Model Performance.

B. Metrics that help to measure your model:

FPR: is high. Since our model predicts everything 1, we have a high number of FP. And it signifies that this is not a good classifier/model.
AUC score: is very low and represents the true picture of evaluation here.

Case-2 Opposite the labels → negative class(0) >>> positive class(1)

Here we just do the opposite of the previous situation:

import numpy as np
import pandas as pd
Y = np.hstack((np.ones((10000,)), np.zeros((100,))))
Y_score = np.random.uniform(0.1,0.51,10000)
df_imb = pd.DataFrame(data=np.array((Y, Y_score)).T, columns=['y','proba'])
df_imb = df_imb.sample(10100)
df_imb.to_csv('Negative greater than positive.csv', index=False)def pred(X):
    N = len(X)
    predict = []
    for i in range(N):
        if X[i] >= 0.5: # sigmoid(w,x,b) returns 1/(1+exp(-(dot(x,w)+b)))
            predict.append(1)
        else:
            predict.append(0)
    return np.array(predict)from sklearn import metricsprint(f'Accuracy score :{metrics.accuracy_score(Y, pred(Y_score)):>{20}}',)
print(f'F1-score%:{metrics.f1_score(Y, pred(Y_score)):>{26}}')
print(f'RoC score:{metrics.roc_auc_score(Y, Y_score):>{25}}')
print(f'Precison:{metrics.precision_score(Y, pred(Y_score)):>{25}}')
print(f'Recall:{metrics.recall_score(Y, pred(Y_score)):>{15}}'))
metrics.confusion_matrix(Y, pred(Y_score))

Accuracy score :  0.9722772277227723
FPR:              0.0232
Precison:         0.18309859154929578
Recall(TPR):      0.52
F1-score:         0.27083333333333337
RoC score:        0.9276659999999999

Key Observations → When the number of Negatives (0)>> Number of Positives(1)

A. Metrics that don’t help to measure your model:

Accuracy: is very high. Since the proportion of TN is very, as the data is imbalanced (high number of -tive class). Numerator i.e TN+TP becomes high.
AUC score: is high. Even more than 50% of Actual positive are predicted as FN. (TPR)
FPR: is low. It gets skewed because of the large number of TN(imbalanced). Even when a classifier makes a lot of FP

AUC score doesn’t capture the true picture when Data-set contain Negative majority class and our focus is the minority positive class

B. Metrics that help to measure your model:

Precision: is very low. Because of the high number of FP
The ration of TP/(TP+FP) becomes low.
Recall: is very low. Since data has a very disproportionately high number of Negative cases. The classifier may detect a larger no. of positive as negative.
The ration of TP/(TP+FN) becomes low.
F1-score: is low. The low values of Precision and Recall make F1- score, a good indicator of performance here.

Bottom Line:

Use the AOC score, when positive class is the majority and your focus class is Negative.
Use Precision, Recall & F1-score, when negative class is the majority and your focus class is positive.
Accuracy score doesn’t help much in Imbalanced situations
High FPR tells, your classifier/Model predicts a high number of False Positives.

Note:

What is “Positive” and what is “negative” is a purely semantic construction(in your situation) — you can simply flip the labels and then decide your focus class based upon the given Business Problem and finally opt the correct Performance Metrics as discussed in the above Article.