Confusion Matrix and its Metrics: Explained

Accuracy, Recall, Specificity, Precision, and F1 score.

7 min readAug 21, 2023

Say an airline company owns 100 airplanes and wants to inspect each of them to see if they are safe for travel. You have to predict if an airplane is malfunctioning or in a perfect state for travel. You have 2 outcomes. Since you have 2 outcomes, the problem becomes a binary classification problem. You need to run a supervised machine-learning algorithm to implement that classification. You chose an algorithm and ran the code.

Well, now what?

How do you know if this algorithm is giving you the best result? How do you know how accurately this particular algorithm is performing its classification? How do you measure this?

Fret not, this is where the Confusion Matrix steps in.

What is a Confusion Matrix?

The Confusion Matrix is basically a performance measure and evaluator for a machine learning algorithm. It is a visual way for you to read your model’s outcome.

It makes combinations of the outcomes of the problem statement as they are the possible results of the algorithm. Here, in a binary classification scenario, as the above figure suggests, there are 4 combinations. Let us understand it in terms of the airplane.

Note:

Here, I have considered the following:

Malfunctioning = TRUE or POSITIVE
Perfect Condition = FALSE or NEGATIVE

Possible Classifications:

True Positive (TP):

You predicted it is true and it is actually true.

Scenario: You predicted the airplane to be malfunctioning and it is actually malfunctioning.

2. False Positive (FP):

You predicted it is true but it is actually false.

Scenario: You predicted the airplane to be malfunctioning but it is actually perfectly functioning.

3. False Negative (FN):

You predicted it is false but it is actually true.

Scenario: You predicted the airplane to be perfectly functioning but it is actually malfunctioning.

4. True Negative (TN):

You predicted it is false and it is actually false.

Scenario: You predicted the airplane to be perfectly functioning and it is actually perfectly functioning.

We require a high number of true positives and true negatives. Since a machine learning model cannot attain 100% accurate results, it is bound to make faulty classifications, hence giving rise to false positives and false negatives, as shown above.

Now, coming back to our plane situation. Let us consider 100 planes that have to be inspected. Out of these planes, 70 are malfunctioning and 30 are perfectly working. We need our model to classify which planes are ready for takeoff and which planes are heading for a disaster in the clouds.

The confusion matrix will look like this:

Considering our airplane scenario, our confusion matrix will look like this:

A more visual breakdown of the Confusion Matrix

In our case, the airplane company has 100 planes, out of which 70 are malfunctioning and 30 are in perfect condition. Say, our machine learning model has finished its classification task. Out of 70 malfunctioning airplanes, it has correctly classified 55 as malfunctioning but incorrectly classified 15 as perfect. Likewise, out of 30 perfectly working airplanes, 25 were classified correctly as perfect whereas 5 were incorrectly classified as malfunctioning.

Therefore, our TP = 55, FP = 5, TN = 25, FN = 15. Let us visualize this in the confusion matrix:

Metrics:

1. Accuracy: How often is the classifier correct?

Accuracy represents the total number of correct classifications over the total instances.

Here, ( 55 + 25 ) / ( 25 + 15 + 55 + 5)

= 80/100

= 80%

Is accuracy reliable?

Accuracy does not give accurate results when the dataset is imbalanced. Consider the same scenario, but say the classifier classified all the 70 malfunctioning airplanes as malfunctioning. However, it classified all 30 of the perfect airplanes as malfunctioning as well. This would give us a high accuracy for the model but it possesses a discrepancy in its results. It can also cause damage to the business. Since 30 of the perfect planes were classified as malfunctioning, extra time and resources have to be set up by the company to inspect 30 additional airplanes. This can cause monetary loss to the company. Hence, accuracy does not provide the best results always.

2. Recall/ Sensitivity/ True Positive Rate: When actually YES, how often does the classifier predict YES?

Recall, also referred to as Sensitivity, is the fraction of the predicted correct true positives over the total actual positives, regardless of its prediction. It is the metric that evaluates a model’s ability to predict the true positives of the actual positives.

Here, Recall = 55 / ( 55 + 15 )

= 55 / 70

= 0.785

Ideally, the Recall score should be close to 1 to suggest that the classifier is a good model. That means the number of positive instances the classifier predicted should be near equal to the total number of actual positive instances. False Negatives should also be low such that a fraction of 1 is obtained. A 0.785 score suggests that the classifier has predicted 78.5 percent of the actual true malfunctioning airplanes as malfunctioning. It predicted the remaining 21.5 percent as perfect.

3. Specificity/True Negative Rate:

Specificity is the metric that evaluates a model’s ability to predict the true negatives of the predicted negatives.

Here, 25 / ( 5 + 25 )

= 25 / 30

= 0.83

A score of 0.83 suggests that the model can predict well.

4. Precision: When it predicts YES, how often is it correct?

Precision is basically the fraction of the actual true positives over the predicted positives, regardless of their true value. It measures the model’s degree of correctly classifying a correct positive instance out of all its positive predicted instances.

Here, Precision = 55 / (55 + 5)

= 55 / 60

= 0.916

Precision becomes high only when the number of False Positives is close to 0. Here, we can infer that our model has classified most of its positive instances correctly due to its high score of 0.916. This suggests that the model correctly predicted 91.6 percent as malfunctioning out of all its malfunctioning classification instances. The remaining 8.4 percent was inaccurately predicted as perfect.

Just like many concepts in Data Science, there is also a trade-off between recall and precision. The more we increase a metric, the other would reduce. At times, depending on the scenario, we would want to maximize either recall or precision at the expense of the other metric. We would rather find out the number of malfunctioning airplanes correctly (High Recall) than find out the number of perfect planes accurately (High Precision). The cost of human life is more important than a company’s resources. Therefore, when we want to find an optimal blend of both high precision and high recall, we can combine the 2 metrics using the F1 score.

5. F1 Score

The F1 score is the harmonic mean of precision and recall, taking both metrics into account in the following equation:

The harmonic mean encourages similar values for precision and recall. That is, the more the precision and recall scores deviate from each other, the worse the harmonic mean. If we want to create a classification model with the optimal balance of precision and recall, then we try to maximize the F1 score.

In the airplane scenario, the F1 score is 2*(0.916*0.785)/(0.916+0.785) = 0.845.

A more deeper and mathematical explanation of the F1 score will be covered in later blogs.

I hope at least a little of your confusion on the Confusion matrix has now cleared up! If you like this post, a tad of extra motivation will be helpful by giving this post some claps 👏. I am always open to your questions and suggestions.

You can reach me at:

LinkedIn: www.linkedin.com/in/vmadhuuu

Github: https://github.com/vmadhuuu

Thanks for Reading!