The Confusion Matrix — Explained like you were five

12 min readMay 7, 2023

Testing how well our model is performing is crucial in any machine learning pipeline. To do this testing, we use some performance metrics that evaluate the performance or quality of our model. These performance metrics are task specific i.e. classification, regression, clustering. In this article we are going to be looking at the confusion matrix, some of its terminologies and the scores derived from it.

A confusion matrix is a matrix(table) that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It actually called a confusion matrix because it shows you exactly what your model is confused about so you can either fix it or understand your model’s limitation. Many useful metrics such as accuracy, precision, recall(sensitivity), f1_score, etc., are derived from it.

A confusion matrix is a performance metric for classification problems (supervised learning). Read my article Machine Learning Algorithms for absolute beginners on the types of learning in ML.

In this article, I will break down the confusion matrix, and how understanding it and its associated terminologies can help you understand performance metric scores like accuracy, precision, recall and F1.

The matrix above represents a binary classification problem. What this means is that we are predicting only 2 things, that is whether someone is hungry or not hungry, whether an image is a cat or a dog, whether an email is spam or not, or whether a customer will churn or not. To understand any given confusion matrix, you need to be clear on a few things. We will use the image above to explain the key concepts below

First you need to know your positive and negative class
The total number of predictions that were made.
And finally, what each column and row stand for

Before reading any confusion matrix, you need to understand what your positive and negative classes are. What I mean is, you need to understand what your desired outcome is. For example, if your are building a model to predict if someone is hungry or not hungry, you need to decide if hungry is your positive class or your negative class. In this case, it makes sense to make hungry the positive class and not hungry the negative class because of our use case. If we were also trying to build a model to predict if some one is pregnant we could make pregnant the positive class and not pregnant the negative class.

Sometimes however, your classes may not have any inherent direction. For example if you are building a model to predict if an image is a cat or dog, there is no inherent positive or negative class, and so the choice is yours as a machine learning engineer to decide which will be you positive class. It could be anything but you need to decide what your positive class is (cat or dog) and document it.

The next thing to note is that, a total of all the 4 numbers will equal the total number of predictions made. From the diagram of the matrix above n=240 represents the total number of predictions that were made. That is 110 + 45 + 10 + 75.

The last thing to take note is that, in Python’s Machine Learning package Scikit Learn, the vertical columns represents what the model predicted, while the horizontal columns represent what those labels actually are. The columns and rows could be switched in a different Machine Learning package or whatever software you are using to evaluate the confusion matrix. However, the order we are discussion is what is used in python’s Scikit Learn package.

CONFUSION MATRIX TERMINOLOGIES

An intersection of what was predicted and what the value actually is will fall within 4 groups for a binary classification task.

From our confusion matrix, we have decided that our positive class is hungry, and our negative class is the not hungry class. You could flip if it you wanted, but for the purposes of simplicity, we will maintain this.

True Positive(TP): Here, the model predicted hungry (positive) and the person is actually hungry(positive). In that case, we say it is truly positive. And so going by our matrix, the TP from the matrix above is 110.
True Negative(TN): In this case, the model predicted not hungry (negative) and the person is also not hungry(negative). That is a truly negative prediction. The TN from the matrix above is 75.
False Positive(FP): In this case, the model predicted hungry (positive), but that is false. The actual value is not hungry (negative). In this case the model has falsely predicted a positive value. The FP from the matrix above is 10.
False Negative(FN): In this case, the model predicted not hungry (negative), but this is false. The actual value is hungry (positive). In this case the model has falsely predicted a negative value. The FP from the matrix above is 45.

If you are wondering why hungry comes first before not hungry, its because they appear in alphabetical order. Now this is important because when a confusion matrix is outputted in python, it does not come with any labels to show that the columns are the predicted values and the rows are the actual values. You need to have this knowledge prior to viewing a confusion matrix in python. In fact, a confusion matrix is going to look something like the image below, but as an array.

Knowing that they appear in alphabetical order means if you had two classes such as cat and dog, the first predicted column will be for cats because it starts with a C followed by dogs which starts with a D. For true and false, the first column will be false, and in a churn or no churn matrix, the first column will be churn and the second is going to be no churn. The rows will also appear in that same order.

In a confusion matrix, the True Positives and True Negatives align diagonally from left to right. The cells highlighted in green in the matrix above shows the True Positives and True Negatives. This is more apparent in a matrix with more than 2 classes as the one shown below. The True values (True Positive and True Negatives) are aligned diagonally, and highlighted in pink in the images below. Everything else (highlighted in blue) is a false prediction. And yes, from 3 class onwards, a confusion matrix becomes difficult to track using the (TP, TN, FP, FN) terminologies, but essentially, the diagonals are your accurate predictions.

Classification metrics using the confusion matrix

You may or may not have noticed at this point, but a confusion matrix itself is not a metric, but rather a table that is used to evaluate the performance of a classification model by comparing the predicted class labels with the actual class labels. It is used to calculate various metrics that can help assess the performance of a classification model such as accuracy, precision, recall, F1 score, etc.

Accuracy

The accuracy metric is determined by taking the number of correct predictions that is (True Positives & True Negatives) divided by all the predictions made.

From our matrix, that would be

Accuracy = (TP+TN)/n
         = 110+ 75 /240
         = 185/240
         =  0.771 

# Our accuracy score is 77%. 

# Note: n =  TP + TN + FP + FN

When to use Accuracy?

Accuracy should be used as a metric for evaluating your model when the different classes are equally distributed, that is, there is almost a 50 50 division in the data set (for binary classifications). When one class has significantly more samples than the other, accuracy can be misleading as the model might appear to perform well by simply predicting the majority class.

For example let’s say we had a data set that is a 1000 in size. The positive class hungry is only 100 and the negative class not hungry is 900. If a confusion matrix predicts not hungry 850 times and hun gry50 times. The accuracy score will be 850+50/1000 which will be an accuracy score of 90%. This a great score on the surface, but what is happening is that, the model is not actually learning, it is simply predicting the majority class not hungry because there is so much of it.

This is not good, because if in future our model comes across an instance which should be predicted as hungry, it may not be able to do so because it has not seen enough data on that class. A model with a 90% accuracy score that still cannot generalize well, is of little use to anybody. This problem arises because of the fact that accuracy as its name suggests takes into consideration just the accurately predicted values without looking at the ratios or anything else. This is a problem that Precision and Recall aims to solve.

Precision

Precision measures the proportion of correctly predicted positive instances (true positives) out of all predicted positive cases (true positives + false positives).

It answers the question: When it predicts yes, how often is it correct? From the example we have been working with, let’s build on top of that and imagine we work for a hypothetical health company called Health Nut. Our task it to build a machine learning model that will be embedded in a hunger detection app. The goal is for our model to be able to precisely estimate whether a person is hungry given certain inputs and alert them to eat something.

Precision is how good or precise we are at correctly predicting or estimating that a user of our app is hungry so they are alerted to have a meal. If our model has high precision, it means it’s really good at guessing when the user is hungry. Which is great because we want to be very precise in our estimation.

But if we have low precision, it means we’re not very good at guessing when a user is hungry. We tell the user they are hungry and that they should go eat when they are not hungry. In this case, our precision is not very good because our model is not being very precise.

At this rate the company could get sued if people who use it become obese or develop a food addiction problem because our predictions are not at all precise.

Precision is calculated by taking True Positive predictions divided by all positive predictions (this includes those that were falsely predicted as positive). The formula for precision is

Again using our matrix, we can calculate our precision as

Precision = TP/(TP + FP)
         = 110/(110 + 10)
         = 110/120
         =  0.916 

# Our precision score is 92%. Impresive! This is 15% better than our 
# accuracy score

Recall

Recall measures the ratio of correctly predicted positive cases (true positives) out of all actual positive cases (true positives + false negatives).

It is calculated as:

Recall is sometimes referred to as sensitivity. Why? You ask. Because it measures how sensitive our model is in prediction our positive class (hungry) correctly. Now, I know what you are thinking. How is this different from precision? Well let me explain.

Precision measures how often the positive predictions made by the model are correct, considering the predictions made, while Recall measures how often the positive predictions made by the model are correct, considering the total actual positive cases.

Using our model from before we can explain the 2 in this way

Precision: If the precision is high, it means that when the model predicts that someone is hungry, it is very likely to be correct. There are fewer false positives, that is very few people that are not hungry are told they are hungry.
Recall: If the recall is high, it means that the model successfully identifies and captures a large proportion of the actual hungry people. There are fewer false negatives, which means that the model does not miss many cases of hungry people and correctly identifies them.

In short, precision focuses on the accuracy of positive predictions, while recall focuses on the coverage of actual positive instances.

Recall = TP/(TP + FN)
       = 110/(110 + 45)
       = 110/155
       = 0.73

# Our recall perform 19% lower than our precision

Now we see that our Recall performs really poorly as compared to precision. If we are performing a task where both precision and recall are important to us, then we need some other type of metric. This is where F1 comes in.

F1 Score

The F1 score is the combination of precision and recall, AKA the Harmonic Mean between precision and recall. It is particularly useful when you want to find a balance between precision and recall, as optimizing one metric (e.g. choosing a high precision) may come at the expense of the other (result in low recall) as we are currently encountering or vice versa. F1 considers both precision and recall to provide a balanced assessment of the model’s effectiveness. The range for F1 Score is 0 –1. The closer the F1 Score is to 1, the better the performance of our model.

F1 is calculated as:

Continuing without example, this would be our F1 score.

f1 = 2 * ((precision * recall)/(precision + recall))
   = 2 * ((0.916 * 0.73) / (0.916 + 0.73))
   = 2 * (0.668/1.646)  
   = 2 * 0.406
   = 0.812

# Our f1 score is 81%

Now we see that our F1 score is better than our recall score, but also lower than our precision. This is the trade off we have to make if both scores are important to us.

Finally, should mention that although there is no universally accepted prediction rate, the common benchmark is a prediction rate of 80% or higher in most applications. It could be even higher in fields like medicine and lower in fields like gaming.

Note: Don’t worry you won’t have to calculate any of these metrics by hand as I did in this article. From the metrics module in sklearn, you can simply import all of these metrics depending on what you need.

When to use F1

It is particularly useful when you have imbalanced classes or when both precision and recall are equally important for your task.

Key Insights

A confusion matrix is a table used in classification models from which performance metrics such as accuracy, precision and recall can be derived from
Use precision when you want to predict the positive class accurately
Use recall when you want to capture a large proportion of the positive class
Use F1 if both precision and recall are important to you.

This is a confusion matrix in a nut shell, but of course we are not going to end here. Let’s have a little test, but without the predicted and actual labels to guide us. Share your answers in the chat. The first one to get all questions correctly gets a special mention in my next article. :)

Let’s have a little test, but without the predicted and actual labels to guide us. Share your answers in the chat. The first one to get all questions correctly gets a special mention in my next article. :)

Test your self

**A confusion matrix for a binary classifier (pregnant or not pregnant)**

Scenario: You want to build a model that can accurately predict if someone is pregnant or not to be used to make a new pregnancy toolkit. Pregnant is your positive class.

How many predictions are True Positives?
How many predictions are True Negatives
How many predictions are False Positives ?
What is the accuracy score?
What is the precision of your model?
What is your F1 score?
Which metric is more appropriate in this scenario?

Happy Coding!

Follow me on GitHub, LinkedIn, and here on Medium

If you want to learn more about Machine Learning and Data Analytics check out Azubi Africa’s Data Analytics Professional Program.