What is a confusion matrix?

Published in

Analytics Vidhya

8 min readNov 17, 2020

Everything you Should Know about Confusion Matrix for Machine Learning

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model.

Binary Classification Problem (2x2 matrix)

A good model is one which has high TP and TN rates, while low FP and FN rates.
If you have an imbalanced dataset to work with, it’s always better to use confusion matrix as your evaluation criteria for your machine learning model.

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is used to measure the performance of a classification model. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.

Confusion matrices are widely used because they give a better idea of a model’s performance than classification accuracy does. For example, in classification accuracy, there is no information about the number of misclassified instances. Imagine that your data has two classes where 85% of the data belongs to class A, and 15% belongs to class B. Also, assume that your classification model correctly classifies all the instances of class A, and misclassifies all the instances of class B. In this case, the model is 85% accurate. However, class B is misclassified, which is undesirable. The confusion matrix, on the other hand, displays the correctly and incorrectly classified instances for all the classes and will, therefore, give a better insight into the performance of your classifier.

We can measure model accuracy by two methods. Accuracy simply means the number of values correctly predicted.

1. Confusion Matrix

2. Classification Measure

1. Confusion Matrix

a. Understanding Confusion Matrix:

The following 4 are the basic terminology which will help us in determining the metrics we are looking for.

True Positives (TP): when the actual value is Positive and predicted is also Positive.
True negatives (TN): when the actual value is Negative and prediction is also Negative.
False positives (FP): When the actual is negative but prediction is Positive. Also known as the Type 1 error
False negatives (FN): When the actual is Positive but the prediction is Negative. Also known as the Type 2 error

For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

Confusion Matrix for the Binary Classification

The target variable has two values: Positive or Negative
The columns represent the actual values of the target variable
The rows represent the predicted values of the target variable

b. Understanding Confusion Matrix in an easier way:

Let’s take an example:

We have a total of 20 cats and dogs and our model predicts whether it is a cat or not.

Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

True Positive (TP) = 6

You predicted positive and it’s true. You predicted that an animal is a cat and it actually is.

True Negative (TN) = 11

You predicted negative and it’s true. You predicted that animal is not a cat and it actually is not (it’s a dog).

False Positive (Type 1 Error) (FP) = 2

You predicted positive and it’s false. You predicted that animal is a cat but it actually is not (it’s a dog).

False Negative (Type 2 Error) (FN) = 1

You predicted negative and it’s false. You predicted that animal is not a cat but it actually is.

2. Classification Measure

Basically, it is an extended version of the confusion matrix. There are measures other than the confusion matrix which can help achieve better understanding and analysis of our model and its performance.

a. Accuracy

b. Precision

c. Recall (TPR, Sensitivity)

d. F1-Score

e. FPR (Type I Error)

f. FNR (Type II Error)

a. Accuracy:

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the number of correct predictions and the total number of predictions.

The accuracy metric is not suited for imbalanced classes. Accuracy has its own disadvantages, for imbalanced data, when the model predicts that each point belongs to the majority class label, the accuracy will be high. But, the model is not accurate.

It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how many predictions are actually positive out of all the total positive predicted.

Accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed or there is no class imbalance.

b. Precision:

Precision is defined as the ratio of the total number of correctly classified positive classes divided by the total number of predicted positive classes. Or, out of all the predictive positive classes, how much we predicted correctly. Precision should be high(ideally 1).

“Precision is a useful metric in cases where False Positive is a higher concern than False Negatives”

Ex 1:- In Spam Detection : Need to focus on precision

Suppose mail is not a spam but model is predicted as spam : FP (False Positive). We always try to reduce FP.

Ex 2:- Precision is important in music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to customer churn and be harmful to the business.

c. Recall:

It is a measure of actual observations which are predicted correctly, i.e. how many observations of positive class are actually predicted as positive. It is also known as Sensitivity. Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.

Recall is defined as the ratio of the total number of correctly classified positive classes divide by the total number of positive classes. Or, out of all the positive classes, how much we have predicted correctly. Recall should be high(ideally 1).

“Recall is a useful metric in cases where False Negative trumps False Positive”

Ex 1:- suppose person having cancer (or) not? He is suffering from cancer but model predicted as not suffering from cancer

Ex 2:- Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!

Recall would be a better metric because we don’t want to accidentally discharge an infected person and let them mix with the healthy population thereby spreading contagious virus. Now you can understand why accuracy was a bad metric for our model.

Trick to remember : Precision has Predictive Results in the denominator.

4. F-measure / F1-Score

The F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall. We use harmonic mean because it is not sensitive to extremely large values, unlike simple averages.

F1 score sort of maintains a balance between the precision and recall for your classifier. If your precision is low, the F1 is low and if the recall is low again your F1 score is low.

There will be cases where there is no clear distinction between whether Precision is more important or Recall. We combine them!

In practice, when we try to increase the precision of our model, the recall goes down and vice-versa. The F1-score captures both the trends in a single value.

F1 score is a harmonic mean of Precision and Recall. As compared to Arithmetic Mean, Harmonic Mean punishes the extreme values more. F-score should be high(ideally 1).

5. Sensitivity & Specificity

3. Is it necessary to check for recall (or) precision if you already have a high accuracy?

We can not rely on a single value of accuracy in classification when the classes are imbalanced. For example, we have a dataset of 100 patients in which 5 have diabetes and 95 are healthy. However, if our model only predicts the majority class i.e. all 100 people are healthy even though we have a classification accuracy of 95%.

4. When to use Accuracy / Precision / Recall / F1-Score?

a. Accuracy is used when the True Positives and True Negatives are more important. Accuracy is a better metric for Balanced Data.

b. Whenever False Positive is much more important use Precision.

c. Whenever False Negative is much more important use Recall.

d. F1-Score is used when the False Negatives and False Positives are important. F1-Score is a better metric for Imbalanced Data.

5. Create a confusion matrix in Python

To explain with python code, considered dataset “predict if someone has heart disease” based on their sex, age, blood pressure and a variety of other metrics. Dataset has columns of 14 and rows of 303.

Count plot showing how many has heart disease or not.

Classification Report:

classification_report() takes in the list of actual labels, the list of predicted labels, and an optional argument to specify the order of the labels. It calculates performance metrics like precision, recall, and support.

Confusion Matrix:

confusion_matrix() takes in the list of actual labels, the list of predicted labels, and an optional argument to specify the order of the labels. It calculates the confusion matrix for the given inputs.

That’s It!

Thanks for reading!

Thanks for the read. I am going to write more beginner-friendly posts in the future. Follow me up on Medium to be informed about them. I welcome feedback and can be reached out on LinkedIn anuganti-suresh. Happy learning!

References:

sklearn.metrics.confusion_matrix - scikit-learn 0.23.2 documentation

scikit-learn: machine learning in Python

scikit-learn.org

Confusion matrix - scikit-learn 0.23.2 documentation

Example of confusion matrix usage to evaluate the quality of the output of a classifier on the iris data set. The…