Let's make the “Confusion matrix” less confusing!!
“Doubts are good. Confusion is excellent. Questions are awesome.
All these are attempts to expand the wisdom of mind.”
― Manoj Arora
In a classification problem, it is often important to specify the performance assessment. This can be valuable when the cost of different misclassifications varies significantly. Classification accuracy is also a measure showing how well the classifier correctly identifies the objects.
A confusion matrix also called a contingency table or error matrix gets across the picture when it comes to visualizing the performance of a classifier. The columns of the matrix represent the instances of the predicted classes and the rows represent the instances of the actual class. (Note: It can be the other way around as well.)
The confusion matrix shows the ways in which your classification model is confused when it makes predictions.
Confusion matrix consists of Predicted and actual value
In this confusion matrix, the green background is the “correct” cells :
- TRUE NEGATIVE (TN): The values that are predicted by the classifiers are false those are actually false.
- TRUE POSITIVE (TP): The values that are predicted by the classifiers are true those are actually true.
And the red background is the “error” cells :
- FALSE NEGATIVE (FN): The values that are predicted by the classifiers are false those are actually true. This is also called TYPE II error.
- FALSE POSITIVE (FP): The values that are predicted by the classifiers are true those are actually false. This is also called TYPE I error.
What is the Need of a Confusion Matrix?
let us consider an example of the COVID-19 virus, let’s say you want to predict how many people are infected with the virus in times before they show the symptoms, and isolate them from the healthy population. The two values for our target variable would be COVID positive and not COVID positive.
You might be thinking why do we need a confusion matrix when we have the accuracy to check our results. Let's check our accuracy 1st!!
There are 1000 data points for the negative class and 30 data points for the positive class. This is how we’ll calculate the accuracy:
The total outcome values are:
TP = 20, TN = 950, FP = 20, FN = 10
So, the accuracy of our model turns out to be:
Here our accuracy is 97%, which is not bad! But it is giving the wrong idea about the result.
Our model is saying “It can predict COVID positive people 97% of the time”. However, it is doing the opposite. It is predicting the people who will not COVID positive with 97% accuracy while the COVID positive are spreading the virus!
Do you think this is the correct way of measuring our result ??? Shouldn’t we be measuring how many positive cases we can predict correctly to arrest the spread of the contagious COVID? Or maybe, out of the correctly predicted cases, how many are positive cases to check the reliability of our model?
This is where we come across the dual concept of Precision and Recall.
Precision vs. Recall
Precision tells us what proportion of patients we diagnosed as having the virus actually had the virus.
Precision is calculated by:
Recall tells us what proportion of patients that actually had virus were predicted by us as having virus.It should be high as possible.
Recall is calculated by:
Note: Recall tells you how much of the +ve’s you can find.
Precision tells you how much junk there is in your predicted +ve’s.
In practice, when we try to increase the precision of our model, the recall goes down, and vice-versa. The F1-score captures both the trends in a single value.
F1 Score= 2*(Precision*recall / precision + recall)
It’s a measure of a test’s accuracy. It considers both the precision and the recall of the test to compute the score using the harmonic mean.
- TRUE POSITIVE RATE (Sensitivity)
2. FALSE POSITIVE RATE (Specificity)
FPR = TN/(TN+FP)
Confusion Matrix using scikit-learn in Python
I have used the most useful python library scikit-learn to explain the confusion matrix.
The dataset is available on Kaggle, i.e titanic dataset.
Evaluating the Algorithm
True Positive is 97
True Negative is 47
False Positive is 12
False Negative is 23
Now can evaluate the model using performance metrics
- You-tube link: Machine Learning Fundamentals The Confusion Matrix by statquest .
- BOOK: https://www.oreilly.com/library/view/machine-learning-quick/9781788830577/35d1aa26-9a98-4fd0-ada0-af922e84579d.xhtml
I hope you must be clear with the confusion matrix, Thank you so much for reading.