Confusion matrix and other metrics in machine learning

Hugo Ferreira
Hugo Ferreira’s blog
11 min readApr 4, 2018

Taking the confusion out of the confusion matrix, ROC curve and other metrics in classification algorithms

Dealing with the confusion matrix can be quite confusing

In my previous blog post, I described how I implemented a machine learning algorithm, the Naive Bayes classifier, to identity spam from a collection of email. It was a simple exercise using scikit-learn, especially for a beginner like me. To measure the performance of the model, I computed one relevant classification metric, the confusion matrix. See more details in the post.

After having done this, I decided to explore other ways to evaluate the performance of the classifier. When I started to learn about the confusion matrix, accuracy, precision, recall, f1-score, ROC curve, true positives, false positives, true negatives, false negatives… I actually became more confused than anything else! What is supposed to be positive, again?

To clarify matters, I’m going to use the above spam filtering algorithm to apply all of these concepts. I’ll give their definition, and the motivation behind them, so that they can be used to any appropriate machine learning algorithm. Think of this post as a cheat sheet for classification metrics, with the spam filter as an applied example.

The code used in this post can be found in this notebook:

It can also be found in my GitHub:

Implementing Naive Bayes

After reading the data, creating the feature vectors X and target vector y and splitting the dataset into a training set (X_train, y_train) and a test set (X_test, y_test), we use MultinomialMB of sklearn to implement the Naive Bayes algorithm.

from sklearn.naive_bayes import MultinomialNBnb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)

We store the predicted outputs in y_pred, which we will use for the several metrics below.

Accuracy

The first metric we are going to discuss is, perhaps, the simplest one, the accuracy. It answers the question:

“How often is the classifier correct?”

It can be obtained simply using the following formulae:

sklearn provides the function accuracy_score to obtain the accuracy:

from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test, y_pred))

The output for the Naive Bayes algorithm is:

0.9907834101382489

Thus, our spam filtering algorithm has an accuracy of 99%, that is, for each 100 emails it classified, 99 were correctly classified as spam or not spam.

Does this mean that our algorithm has an excellent performance? Suppose that our dataset had 99% real emails and 1% spam and that we built a classifier that predicted that all emails were real. Then, this algorithm would be 99% accurate, but horrible at classifying spam! It is important to have other ways to measure the performance of the algorithm.

Confusion matrix

The confusion matrix is another metric that is often used to measure the performance of a classification algorithm. True to its name, the terminology related to the confusion matrix can be rather confusing, but the matrix itself is simple to understand (unlike the movies).

In this post, let’s focus in binary classifiers as with the spam filtering example, in which each email can be either spam or not spam. The confusion matrix will be of the following form:

The predicted classes are represented in the columns of the matrix, whereas the actual classes are in the rows of the matrix. We then have four cases:

  • True positives (TP): the cases for which the classifier predicted ‘spam’ and the emails were actually spam.
  • True negatives (TN): the cases for which the classifier predicted ‘not spam’ and the emails were actually real.
  • False positives (FP): the cases for which the classifier predicted ‘spam’ but the emails were actually real.
  • False negatives (FN): the cases for which the classifier predicted ‘not spam’ but the emails were actually spam.

In order to avoid confusion, note the following. ‘True’ or ‘false’ indicate if the classifier predicted the class correctly, whereas ‘positive’ or ‘negative’ indicate if the classifier predicted the desired class (in this case, ‘positive’ correspond to ‘spam’, as this is the type of email we want to predict).

The entries of the confusion matrix are the number of occurrences of each class for the dataset being analysed. Let’s obtain the confusion matrix for our spam filtering algorithm, by using the function confusion_matrix:

from sklearn.metrics import confusion_matrixprint(confusion_matrix(y_test, y_pred))

The output is:

[[724   7]
[ 1 136]]

Let’s interpret these results.

  • Out of the 731 actual instances of ‘not spam’ (first row), the classifier predicted correctly 724 of them.
  • Out of the 137 actual instances of ‘spam’ (second row), the classifier predicted correctly 136 of them.
  • Out of all 868 emails, the classifier predicted correctly 860 of them.

This last comment allows us to obtain the accuracy from the confusion matrix, by applying the following formulae:

We obtain again that 99% of the predicted outputs were correctly classified. However, the confusion matrix allows us to have a better picture of the performance of the algorithm.

Precision, recall and f1-score

Besides the accuracy, there are several other performance measures which can be computed from the confusion matrix. Some of the main ones are obtained using the function classification_report:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

The output is:

		precision    recall    f1-score   support        
False 1.00 0.99 0.99 731
True 0.95 0.99 0.97 137
avg / total 0.99 0.99 0.99 868

Let’s go through the list:

  • Precision: it answers the question:

“When it predicts the positive result, how often is it correct?”

This is obtained by using the following formulae:

Precision is usually used when the goal is to limit the number of false positives (FP). For example, this would be the metric to focus on if our goal with the spam filtering algorithm is to minimize the number of reals emails that are classified as spam.

  • Recall: it answers the question:

“When it is actually the positive result, how often does it predict correctly?”

This is obtained by using the following formulae:

Recall is usually used when the goal is to limit the number of false negatives (FN). In our example, that would correspond to minimizing the number of spam emails that are classified as real emails. Recall is also known as “sensitivity” and “true positive rate” (TPR).

  • f1-score: this is just the harmonic mean of precision and recall:

It is useful when you need to take both precision and recall into account. If you try to only optimize recall, your algorithm will predict most examples to belong to the positive class, but that will result in many false positives and, hence, low precision. On the other hand, if you try to optimize precision, your model will predict very few examples as positive results (the ones which highest probability), but recall will be very low.

ROC curve

A more visual way to measure the performance of a binary classifier is the receiver operating characteristic (ROC) curve. It is created by plotting the true positive rate (TPR) (or recall) against the false positive rate (FPR), which we haven’t defined explicitly yet:

The question it answers is the following:

“When it is actually the negative result, how often does it predict incorrectly?”

Let’s see how we can obtain this curve. First, note that our Naive Bayes algorithm isn’t only able to predict if each email is spam or not, but it can also give us the predicted probability for such event. Recall from my previous post that the probability of an email being spam, for example, given its features, is given by Bayes’ theorem,

and we assume “naively” that the features are conditional independent of the class,

The predicted probability for the test set can be obtained in sklearn with:

y_pred_prob = nb.predict_proba(X_test)[:,1]

Now that we have the predicted probabilities for each email, how do we decide if it is spam based on the values of those probabilities? That is, what is threshold for the probability above which we classify the email as spam?

It seems reasonable, at least at first, to take the threshold to be 0.5. The nice thing about the ROC curve is that we can visualize how the performance of the classifier changes as we vary the threshold.

First, let’s plot the ROC curve for the case at hand by importing roc_curve from sklearn.metrics, which gives us the TP and FP rates:

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
%matplotlib inline

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# create plot
plt.plot(fpr, tpr, label='ROC curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
_ = plt.xlabel('False Positive Rate')
_ = plt.ylabel('True Positive Rate')
_ = plt.title('ROC Curve')
_ = plt.xlim([-0.02, 1])
_ = plt.ylim([0, 1.02])
_ = plt.legend(loc="lower right")

To understand this plot, let’s analyse it in steps.

  • Suppose we take the threshold to be 0, that is, all emails are classified as spam. On the one hand, this implies that no spam emails are predicted as real emails and so there are no false negatives — the true positive rate (or recall) is 1. On the other hand, this also means that no real email is classified as real, and thus there are no true negatives — the false positive rate is also 1. This corresponds to the top-right part of the curve.
  • Now suppose that the threshold is 1, that is, no email is classified as spam. Then, there are no true positives (and thus the true positive rate is 0) and no false positives (and thus the false positive rate is 0). This corresponds to the bottom-left of the curve.
  • The rest of the curve corresponds to values of the threshold between 0 and 1, from the top-right to the bottom-left. As you can see, the curve approaches (but does not reach) the corner of the plot where the TP rate is 1 and the FP rate is 0 — that is, no spam emails are classified as real and no real emails are classified as spam. This is the point of perfect classification.
Diagram illustrating TP and FPs. The TP rate is the proportion of emails predicted as spam which are actually spam. The FP rate is the proportion of actual real emails which are predicted as spam.
  • If we are in the diagonal line, that means that the proportion of emails predicted as spam which turn out to be actual spam is roughly the same as the proportion of real emails which are predicted as spam. This is as good as random guessing, and a classifier with this performance would be pretty terrible.

The above points suggest that the area under the ROC curve (usually denoted by AUC) is a good measure of the performance of the classification algorithm. If it is near 0.5, the classifier is not much better than random guessing, whereas it gets better as the area gets close to 1.

We can obtain the AUC by importing roc_auc_score from sklearn.metrics,

from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred_prob)

The output for our classifier is:

0.9977033760372253

The AUC is indeed quite close to 1, and so our classifier is very good at minimizing false negatives (spam which is classified as real) and true negatives (real email which is classified as real).

Note that, since we are taking the area under the whole ROC curve, the result is not related to any particular threshold. Therefore, a high AUC does not tell us which is the best threshold to obtain useful classification predictions.

Precision-recall curve

As discussed above, changing the threshold for the predicted probability (above which we classify the email as spam) has an effect on the performance of the algorithm. For example, the true positive rate, or recall, is 0 if we set the threshold as 1, as no email is classified as spam, so it might be a good idea to have a smaller threshold. But having a recall of 1 is not necessarily good, as a model which classifies everything as spam has recall equal to 1, but also very low precision, as there will be a lot of false positives.

A good way to illustrate this trade-off between precision and recall is with the precision-recall curve. It can be obtained by importing precision_recall_curve from sklearn.metrics:

from sklearn.metrics import precision_recall_curve 

precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
# create plot
plt.plot(precision, recall, label='Precision-recall curve')
_ = plt.xlabel('Precision')
_ = plt.ylabel('Recall')
_ = plt.title('Precision-recall curve')
_ = plt.legend(loc="lower left")

As with the ROC curve, each point in the plot corresponds to a different threshold. Threshold equal to 0 implies that the recall is 1, whereas threshold equal to 1 implies that the recall is 0, so the threshold varies from 0 to 1 from the top-left to the bottom-right of the plot. Note that the precision starts from roughly 0.7, as there aren’t many false positives (real emails classified as spam).

With the precision-recall curve, the closer it is to the top-right corner, the better the algorithm. And hence a larger area under the curve (AUC) indicates that the algorithm has higher recall and higher precision. In this context, the area is known as average precision and can be obtained by importing roc_auc_score from sklearn.metrics,

from sklearn.metrics import average_precision_scoreaverage_precision_score(y_test, y_pred_prob)

The output for our classifier is:

0.9797416137620364

Once again, I stress that this number is not connected with any particular threshold (we are averaging over all possible thresholds), and so it doesn’t tell us which is the best threshold to consider for useful classification predictions.

These are just some of the metrics we can use to measure the performance of binary classification algorithms. I found difficult to keep track of all the names in the beginning, so keeping a list of the concepts as in this post helped me to clarify the ideas (if not to memorize them).

You can find me at:

--

--

Hugo Ferreira
Hugo Ferreira’s blog

Data Scientist and Machine Learning enthusiast; physicist and maths geek.