Interpreting machine learning model performance measures

Published in

IBM Data Science in Practice

9 min readDec 1, 2019

Like any other software development, testing and evaluating your machine learning model is very essential before the model can be used for making actual predictions. Quality evaluations like Accuracy, Precision, Recall, Sensitivity, and other quality metrics are often used to measure the model. But measuring one metric may not give a complete picture of your model and we need to consider multiple metrics to better understand the different dimensionality of the model. As part of this story, we try to cover a group of metrics for measuring the quality of the machine learning model.

Let’s say we are building a binary classification model to classify whether an email is a spam (undesired email) or ham (desired email).

There are two possible predicted classes: “yes” and “no”. If the model is predicting the email as spam, then the classified label would be “yes”, and otherwise “no”, indicating it’s not a spam email.

Let’s say the number of emails to classify is 295.

Out of 295 emails, our ML model has predicted “yes” for 190 emails, and “no” for105 emails.

In reality, out of 295 emails, there are 180 spam emails and 115 ham emails.

To depict it over a diagram:

Distribution of email labels and predictions

If we were to build the confusion matrix then it would be like this:

Let’s define the above numbers in the matrix:

True Positives

True positives are all those emails that the model predicted as spam, and are in reality indeed spam. As per the manual labels, there are 180 spam emails, and the number of predicted spam emails is 190. What this means is, that out of 190 predicted spam mails, there are at least 180 emails which are actually spam emails. Now, note in the above diagram there is a difference of 10 non-spam mails between the number of manually labeled emails and model-predicted labeled emails.

Now for any ML model ideally, not all emails that are originally labelled as spam will be predicted as spam
not all emails that are originally labelled as spam will be predicted as ham
and, not all emails that are originally labelled as ham will be predicted as spam
not all emails that are originally labelled as ham will be predicted as ham

To make our emails data set example more realistic, let’s remove these 10 mails from 180 spam mails to consider the remaining 170 as true positives.

False Positives

These are all those emails that are predicted as spam but are not actually spam. So in our case as the total predicted spam mails are 190, and we have already found out that the true positives are 170, then the remaining 20 would be false positives.

True Negatives

All those emails which are predicted as non-spam and they are indeed non-spam mails. As there are 115 manually labelled non-spam mails and we deduced that there are 20 emails which are predicted as spam and not actually spam, so the number of true non-spam mails would be 115–20 = 95 true negatives.

False Negatives

All those emails which are predicted as non-spam mails but are actually spam mails. In our case, it would be 10 such mails.

With these numbers in hand, let’s try to find out the various performance metric numbers.

Quality Metrics

Accuracy

Accuracy (Acc) is defined as a ratio between the correctly classified samples to the total number of samples as follows:

Acc = TP + TN / (TP + TN + FP + FN)

In our case the accuracy of the model to correctly classify emails as spam and not-spam is:

Acc = 170 + 95 / (170 + 195 + 20 + 10) = ~0.9 (almost a good value)

Error Rate

The complement of the accuracy is the Error rate (Err), or misclassification rate, representing the number of samples that are misclassified from both positive and negative classes, and it is calculated as follows,

Err = 1 — Acc = FP + FN / (TP + TN + FP + FN)
In the above case:
Err = 20 + 10 / (170 + 195 + 20 + 10) = ~0.1

The error rate of the model is where it is misclassifying spam emails as non-spam emails and non-spam emails as spam emails.

True Positive Rate

Sensitivity, or True Positive Rate (TPR), Hit Rate, or Recall, of a model represents the rate at which model is correctly classifying positive outcomes, and it is calculated as follows,

TPR = TP / (TP + FN) = TP / P
In our case, the sensitivity is:
TPR = 170 / (170 + 10) = ~0.94

The true positive rate is the rate at which model is truly predicting spam emails, which in our case is around 94%, which seems to be good.

To explain a bit more, Sensitivity is the ratio of true positive outcomes to the total positive outcomes. Meaning,

If Sensitivity = 0, this means that the model is poorly performing and can not make even a single correct prediction.
If Sensitivity = 1, this means that the model is performing good to correctly make a prediction.

So the question is, if we are measuring Accuracy of the model, then why do we need to measure Sensitivity, and what benefit it provides?

To answer this, let’s take another example and say we have a emails sample size of 100 out of which the manually labelled spam emails are 10 and not-spam emails are 90. And the model is trained with this data. Because the data is imbalanced (90 labels and 10 labels), for all the 100 emails, the model has predicted them not to be non-spam emails.

If we create a confusion matrix for this data, we have

True Positive = 0
True Negative = 90
False Positive = 0
False Negative = 10

Then the accuracy of the model is 90/100 = 90%, and one can say that the model is highly accurate.

But let us now check the Sensitivity = True Positive / Total Positive = True Positive / (True Positive + False Negative) = 0 / ( 0 + 10 ) = 0.

With Sensitivity = 0, we can say that the model is poorly performing and it can not correctly classify even a single prediction. Thereby, this shows that how important is to measure sensitivity to know about the model performance.

True Negative Rate

Specificity, True Negative Rate (TNR), or Inverse Recall, of a model represents the rate at which model is correctly classifying negative outcomes, and it is calculated as follows,

TNR = TN / (FP + TN) = TN / N
In the above case:
TNR = 95 / (20 + 95) = ~0.83

The true negative rate is the rate at which the model is truly predicting non-spam emails, which in our case is around 83%, which is an okay value. We could have been a better value here.

If Specificity = 0, this means that the model is poorly performing and can not make even a single negatively classified prediction.
If Sensitivity = 1, this means that the model is performing good to correctly making the negatively classified predictions.

Well, let’s take above 100 emails example again. So, Specificity would be 90 / (90 + 0) = 1, or 100%. Meaning, the model is predicting the negatives to be correctly.

False Positive Rate

False positive rate (FPR), False Alarm Rate (FAR), or Fallout, represents the ratio between the incorrectly classified negative samples (false positives) to the total number of negative samples. It is the proportion of the negative samples that were incorrectly classified, which compliments true negative rate.

FPR = 1 — TNR = FP / (FP + TN) = FP/N

In the original example, it would be the rate at which the model is predicting spam mails that are not actually spam emails, which when computed would be:

FPR = 20 / (20 + 95) = ~0.17

The ideal value for False Positive Rate should be 0, or close to 0, so that model does not raise False Alarms.

False Negative Rate

False negative rate (FNR) or Miss Rate is the proportion of positive samples that were incorrectly classified, which complements the true positive rate measure and it is defined as

FNR = 1 — TPR = FN / (FN + TP) = FN/P

In our case it would be rate at which model is predicting non-spam mails but are actually spam mails, which when computed would be:

FNR = 10 / (10 + 170) = ~0.05

The lesser the False Negatives as compared to total Positives is a better outcomes. Meaning, the ideal value should be 0, or close to 0 so that Model is not missing any positive outcomes.

Positive prediction value (PPV)

Predictive values (positive and negative) reflect the performance of the prediction.

Positive prediction value (PPV) or Precision represents the ratio of positive samples that were correctly classified to the total number of positive predicted samples.

PPV = Precision = TP / (FP + TP)
In our case it would be the rate at which model is correctly classifying spam mails.
PPV = 170 / (20 + 170) = ~0.89 (not a bad value actually)

As the ratio indicates, it is better to get more True Positive outcomes as compared to False Positives. Meaning, if a set of emails are Spam emails, we should expect that the model should indeed predict them as Spam emails. The ideal value should be 1, or close to 1 so that Model is not making correct positive predictions.

False Discovery Rate

The False Discovery Rate (FDR) is, in this example, the rate at which the model is classifying all those emails that are predicted as spam but are not actually spam. We want the false discovery rate to be low, because we do not want the model to classify a non-spam mail (say a mail that we got from our personal banker) as a spam email.

It is calculated as follows:

FDR = FP / (TP + FP)
FDR = 20 / (20 + 170) = ~0.1.

Ideally it should be as close to 0 as possible.

Negative predictive value (NPV)

Negative predictive value (NPV), or Inverse Precision, or True Negative Accuracy (TNA) measures the ratio of negative samples that were correctly classified to the total number of negative predicted samples.

NPV = TN / (FN + TN)

In our case, if would be the rate which the model is correctly classifying non-spam emails.

NPV = 95 / (10 + 95) = ~0.904 (actually a good value).

Ideally it should be as close to 1 as possible.

Fasle Omission Rate

For our use case here, the False Omission Rate (FOR) is the rate at which the model is predicting all those emails which are predicted as non-spam mails but are actually spam emails. We want the false omission rate to be low, because we do not want to see a promotional email in our inbox which ideally should be filtered to be the spam box.

FOR = FN / (TN + FN)
FOR = 10 / (95 + 10) = ~0.08 (actually a good value)

Ideally it should be as close to 0 as possible.

If you notice, the false discovery rate (FDR) and the false omission rate (FOR) measures complements the PPV and the NPV, respectively.

Matthews correlation coefficient (MCC)

This metric represents the correlation between the observed and predicted classifications, and it is calculated directly from the confusion matrix as in the below equation. A coefficient of +1 indicates a perfect prediction, -1 represents total disagreement between prediction and true values and zero means that the prediction is no better than random prediction

MCC(just like other metrics) is calculated using the TN, FN, TP, and FP. MCC is not a biased metric and flexible enough to work properly even in highly unbalanced data.

MCC returns a value between -1 and +1.
A coefficient of +1 represents a perfect model prediction,
A coefficient of 0 represents no better than a random prediction,
A coefficient of −1 indicates total disagreement between prediction and manual label.

MCC = TP * TN — FP * FN / sqrt( (TP+FP)(TP+FN)(TN+FP)(TN+FN))

In our case it would be:

MCC = (170*95) — (20/10) / sqrt( (190 * 180 * 115 * 105) ) = 0.784 (which is almost a good value).

Conclusion

As a summary, we have defined and discussed a set of quality metrics to measure model performance. Each metric provides a different view of the model performance and a Data Scientist has to consider multiple metrics before deciding the model can be used for making actual prediction at runtime.

I hope this article was useful to you, and for any questions please provide a comment.

Thank You!