Confusion Matrix in Machine Learning

Nikhil agarwal
Nerd For Tech
Published in
7 min readFeb 14, 2021

Introduction

When we start our machine learning classification problem with data we always try to get the best model for our data. But hold on how can we decide, whether our model is correct for our given data. Here comes in picture the confusion matrix which describes the performance of a classification model.
First of all, I would like to take a question which is very common and mistaken by many whether accuracy is enough to decide is model is best for any data or not.

For sure accuracy is one of an important metric to consider but it sometimes is not able to depict the whole picture. As we need our model to be more robust than giving us high accuracy. By robust model, I mean to predict the best possible value for all the classes of our model.

As sometimes when we train our model our data for example is considered of 2 classes, Boy and Girl and it consists of 97000 boys and 3000 girls. Suppose we have a model that classifies every testing data correctly which gives our accuracy to be 97%. But at the same time if we test our data on some real girl datasets our model will fail even after this much huge accuracy. This is because of the high ratio of a boy in our training model which leads maths behind to train particular types of variables favoring boys’ prediction. Let’s see further now how the confusion matrix helps us to deal with this problem.
Roadmap of the blog: Introduction -> Basic Jargons -> Confusion matrix example explained with each metric ( Accuracy, Error Rate, Recall, Recall, Precision, F1-Score, Fβ Score, Specificity, FPR)

This blog aims to answer following questions:
1. Theory about confusion matrix.
2. Why it is used.
3. About Accuracy, F1-score, Precision, Recall and more metrics.

Before starting with the confusion matrix we will go through the basic terminologies with their meaning to understand further concepts easily and more efficient way.

Basic jargons in classification matrix:

We will understand all the jargon by making a scenario and looking at which case lies in which jargon.
Scenario: You had a Covid-19 test.
A Tip point from my side that helps me to remember all of this similar jargon is the predicted one is on the right side written in positive/negative form and left side we have actually one write in True/False form. Visualize every jargon in this format and it will be permanently settled in your memory without any confusion. Further, when I will explain I will explain how to visualize any one or two of them.

True Positive(Tp): When the actual event is true and predicted is also true.(Actual= 1, Predicted=1)
Example considering the scenario: You have Covid, and you predicted it that you have covid.
Visualization: Your prediction is positive means Predicted=1(Covid) and Your actual value is true with respect to your predicted one. Therefore Actual=1(Covid).
True Negative(Tn): When the actual event is not true and predicted is not true.(Actual= 0, Predicted=0)
Example considering the scenario: You don’t have Covid, and you predicted it correct.
Visualization: Your prediction is negative means Predicted=0(not covid) and your actual value is true with respected to your prediction means Actual=0(not covid).
False Positive(Fp): When the actual event is false and you predicted it to be true.(Actual= 0, Predicted=1). Also known as Type-I error.
Example considering the scenario: You don’t have Covid, and you predicted you have covid.
False Negative(Fn): When the actual event is true and you predicted it to be false.(Actual= 1, Predicted=0). Also known as Type-II error.
Example considering the scenario: You have Covid, and you predicted you don’t have covid.

To check which one predicted is true among Tp, Tn, Fp, Fn you can take Xnor between the predicted and actual values. And can conclude Tp and Tn are rightly predicted ones.
Now we will see some new terms of classification metrics with an example of a confusion matrix in each term given below :
Image for post

1. Accuracy: The accuracy tells us about how often over model is giving us correct predictions. It is calculated by fractions of samples predicted correctly(i.e (Tn + Tp)/(Tp + Tn + Fp + Fn) )
In above example Accuracy= 150/165(i.e 90.91 %)

2. Error Rate: The Error Rate tells us about how often over model is giving us a wrong prediction.It is calculated by fractions of samples predicted wrong(i.e (Fn + Fp)/(Tp + Tn + Fp + Fn) )
In above example Accuracy= 15/165(i.e 9.09 %)
Error=1-Accuracy

3. Recall(Sensitivity): Recall is the fraction of true events that you have predicted correctly.
In other words when the actual value is true(positive) then how often the predicted value is correct, is recall.
Calculated by : Tp / (Fn + Tp)
In above example Recall= 100/105(i.e 95.24 %)
As we can see by formula Tp can be both 0 or 1 in case, so we can conclude recall will be between 0 to 1, both inclusive. It tells us how great our model is when all the actual values are positive.
Now considering 2 cases when recall=1 or recalll=0.
When Recall=1, this means the model might be good enough that it is prediction all actual true value correctly. How so, will be discussed further.
When Recall=0, this means the model is disrupted and can not make a single correct prediction when the actual value is true.
Now the question arises why we need recall then if accuracy is providing us all this information?
To answer this let us consider an example to make it simpler to understand.
Consider a classification problem a dog. Consider we have 85 instances of not dog(i.e 0) and 15 instances of dog(i.e 1). Now let us assume due to discrepancy mathematics involved behind our model predicted all to not dog or we can 0.
Accuracy is 85/100=85%
If one only checks the accuracy then the accuracy of the model is really high. Now let us check the recall.
Recall=0/15=0
We get recall = 0, which means that the model is disrupted and it can not correctly classify even a single data point when the actual value is true.
Hence, this shows that how important recall is to comment on the model performance.
Also, this shows that accuracy only is not the best way to evaluate a model.
So we can conclude that we don’t want a high-value recall. But what if its values are really high and near to 1
If Recall=1, it will be only possible if the model has predicted all the actual true value correct and which would lead to low precision. So here comes the question what is Precision?

4. Precision: Precision is the fraction of predicted positive events that are actually positive.
In other words, it means the probability of when our model is predicted true and it is correct prediction.
Calculated by : Tp / (Fp + Tp)
In above example Recall= 100/110(i.e 90.91 %)
This means that we have to maximize the precision of our model, but if the precision is too high or we can say close to 1 then the model will have very low recall due to a high number of false negatives. So what we can conclude by this?
We have to balance both recall and precision as maximizing one minimizes the other.
There also comes some instances or we can say datasets in which we have to maximize one of them over others as it does not affect others. But it is not blog relevant so I would not go into its details just written for future if any such dataset comes in the picture.
To get more ‘combined’ expression for the precision-recall situation we consider another metric called F1-Score.

5. F1-Score: It is a harmonic mean of precision and recall. It is interpreted as a weighted average of Precision and Recall.
F1 Score= 2*(Recall * Precision) / (Recall + Precision)
In above example F1-score=2*(100/105 *100/110) / (100/105+100/110)
=0.93
We don’t use the simple average and use the harmonic mean as it punishes extreme values. If we consider a case of precision to be 1 and recall to be 0 has an average of 0.5 but an F1 score of 0. F1 score gives equal weight to both precision and recall. Here comes in picture Fβ metric where β can be adjusted to give more weight to either one of them precision or recall.
So, if we want a balanced classification model with optimal values of recall and precision then we try to maximize the F1-score.
The F1-Score is one case of the Fβ metric, let’s understand the general case now.

6.Fβ-Score: It is also a metric same F1 score only differs in the case that it does not give equal weight to both precision and recall. It allows us to give more weight to either of them.
Image for post
Most commonly used beta values are, β = 0.5, 1 and 2
In the above example taking β=1, it is equal to F1-Score and can be calculated for any value by putting it in β place.

7. Specificity: Also known as True Negative Rate. It is actually correct predicted negatives. It tells us how great our model is in predicting false values correctly.
Specificity = Tn/(Tn + Fp)
In the above example Specificity=50/60(83.3%)

8. False Positive Rate(FPR): It is the ratio between the number of false events wrongly classified as positive(i.e false positive) and the total number of actual false events.
So, FPR tells us that when the actual value is false then how wrongly have the model performed to classify the data.
In other words, if we look at only the negatives(false), then how wrongly the model has classified them as positive(true) is what FPR is.
FPR = Fp/(Fp + Tn)
In the above example False Positive Rate(Fpr)=10/60(16.67%).
I hope all basics with a deep understanding of the confusion matrix are cleared to you. Keep reading and keep hustling.

You can reach me at:
LinkedIn : https://www.linkedin.com/in/nikhil-agarwal-4881a9195/
Gmail: nikhilagarwal82537@gmail.com
Github: https://github.com/nikhil24agarwal
Thanks for Reading!

--

--