Did the confusion matrix ever confuse you?

Here are some important tips to understand the confusion matrix and never get confused again

Pranavi Duvva

Published in

WiCDS

4 min readFeb 11, 2021

Let’s get started…

Introduction

A confusion matrix is also known as an error matrix. It helps us to analyze the performance of all the classification models. It gives a clear insight into how well your classification algorithm is working.

The confusion matrix contains detailed information about how many of the data points have been correctly classified as per the class of interest and the total number of misclassified data points. It gives a summary of the True positives, True negatives, False positives, and False negatives.

Consider an example of a binary classification problem.

Let’s say you have been assigned the task of analyzing the dataset related to cancer, with cancer positive (1)and negative(0) as the classes. Based on this data you have to build a model to predict whether a person x in the future will be cancer positive or not.

The class of interest to us for this dataset is the positive case which is 1. So after analyzing the data, you finished building your classification-based model.

Now, To understand how well your model can distinguish the data and predict correctly we use the confusion matrix.

A good model is one that has minimum misclassification errors. Especially for the case like detecting cancer, we must ensure the model has relatively less number of false negatives.

Structure of the confusion matrix

The confusion matrix is a 2*2 matrix where each row will be the instances of the actual class while the columns represent the instances of the predicted class.

You can also have it the other way round where each row represents the predicted class and the columns will represent the actual class.

Tip 1:

Always stick to any one of the formats to avoid confusion. I would prefer to have actual instances on the rows while the predicted instances on the columns.

1. True Positive

The True positive refers to that number wherein the data points belonging to the cancer positive class have been correctly predicted by the model as positive.

True positives are, All those cases that belong to the class of interest to us and the model also has predicted them to belong to that particular class with no error.

2. True Negative

The True negative refers to that number wherein the data points belonging to the cancer negative class have been correctly predicted by the model as negative.

True Negatives are, All those cases that do not belong to the class of interest to us and the model also has predicted that they do not belong to the class of interest with no error.

3. False Positive

The False-positive refers to all those cases where a particular person does not have cancer but has been predicted as cancer positive by the model.

False positives are, all those cases where the model has misclassified a negative with a positive indicating an error. A false positive error is termed as the Type 1 error.

4. False Negative

The False Negative refers to all those cases where a particular person has cancer but he has been predicted as cancer negative by the model.

False Negatives are, all those cases where the model has misclassified a positive with a negative indicating an error. A false Negative error is termed as the Type 2 error.

Before we proceed further, to derive the performance metrics from the confusion matrix, Here is another important tip

Tip 2:

Always make sure the input given to the confusion matrix is in the form of actual value y first and then the predicted value of y ( y_predicted )

To get the actual instances on the rows while the predicted instances on the columns.

sklearn.metrics.confusion_matrix(y,y_predicted)

The output of this will be in the form of a 2*2 matrix

Performance Metrics

Accuracy

Accuracy tells us how accurately the model classifies the data points.

Accuracy=(TP+TN)/(TP+TN+FP+FN)

Lower the number of False predictions higher the accuracy of the model.

2. Sensitivity or Recall

The recall tells us about, Of all the data points that you have as cancer positive in the data set how many have been correctly identified by the model as cancer positive.

Sensitivity or Recall=TP/(TP+FN)

Note: Sensitivity and False Negative(Type 2 error) are inversely proportional.

3.Precision

Precision tells us about, Of all the number of predicted cancer positive by the model how many are actually cancer positive.

Precision=TP/(TP+FP)

Note: Precision and False Positive (Type 1error) are inversely proportional.

4.F1 score

To compare models on both Precision and Recall, we use the F1 score.

F1 score is the harmonic mean of Precision and Recall.

F1 score=2*(Precision*Recall)/(Precision+Recall)

Conclusion

A confusion matrix is a very efficient tool to analyze the performance of the classification-based models. With the help of this summarized table, you can hyper-tune your algorithm to achieve better performance of your model.

Hope I was able to clear your confusion!

Thanks for reading!

References

https://en.wikipedia.org/wiki/Confusion_matrix