Confusion Matrix: Is my Classifier any good?

The Story of the AI who cried wolf

Badreldin Mostafa
unpack
5 min readMar 15, 2021

--

Alarms blaring through the loudspeakers in the guard's shed. A wolf has been detected near the camping area. Indifferently, the guard reaches for his mouse and clicks ‘Cancel alarm’, and goes back to his nap. When asked later when some food was found missing he replied: “This useless computer always thinks my puppy for a wolf”. Luckily, it was just some food this time.

As we increasingly depend on AI and machine learning for mission-critical operations from medical diagnosis to self-driving cars. It is essential to ensure the quality of our models. In this article, we will focus on the “Confusion matrix” a set of metrics for judging the quality of our classification model.

We might use different algorithms to train our models and we need to have a reliable way to judge their quality, we might train a logistic regression, a KNN a random forest, or even a neural network on our data set and notice different performance between them; occasionally there is a clear winner, but that is not always the case. The confusion matrix comes to the help.

Assume from our dataset we want to identify whether a person is at risk for hypertension or another disease based on a set of health variables. In this case, the output is binary (Yes/No) at risk or not at risk. Ultimately, we want the model to identify the people at risk (Positive) correctly we call that True Positive (TP) and those not at risk (Negative), and we mark those as True Negative (TN). But since we do not live in a perfect world the models usually misjudge some data observations. False Positive (FP) when it identifies the person is at risk while they are in fact healthy and False Negative (FN) when it identifies a person to be healthy while in fact, they are at risk. A confusion matrix summarizes the results in a visually appealing way. Assume after training our model we arranged a test set of 100 observations (a hundred people) who we know that half of them are actually at risk and half are healthy. We test the model and hide the labels and we get the following results:

In this case, the model has correctly identified 40 people at risk who were actually at risk(TP) but unfortunately mistaken 10 of them as healthy while they were at risk (FP). Likewise, it identified 45 healthy people as healthy (TN) but mistaken 5 healthy people as at risk (FN) imagine if you were in their place.

Different models will result in different distributions of in the confusion matrix. So we define a few metrics to judge the quality of the model, we start with accuracy. Accuracy is how many true observations (TP+TN) out of all the observations (100). In that case, it is (40+45/100) or 85%. The higher the accuracy the better the model.

However, accuracy is usually insufficient to judge a model especially in the case of unbalanced data sets. Let us revisit our camp, in this area, there are wolves, bears, and puppies. Puppies are to be left alone but bears and wolves each have a different safety protocol the guards need to implement. Our algorithm needs to distinguish between all three of them. Notice that the model is no longer a simple binary classification, now we have 3 possible classes. One model (say logistic regression) might have higher overall accuracy compared to another model (say KNN) which can lead us to prefer it. But each model might be doing different types of mistakes. For instance, model 1 might have a 90% accuracy better than model 2 with 75%, however, model 1 might have most of its inaccuracy because it is so bad at detecting wolves, but perfect with bears and puppies. Model 2 might be not as good with puppies and bears (but not considerably worse) but also good at identifying wolves. In that case, we might opt to choose model 2.

Other metrics are needed for judging the models. Namely precision and recall and we implement them for each class separately.

Precision for a class let’s say the wolves class, is how many times we had true positives ( we identified an observation as a wolf and it was actually a wolf) divided by all the observations we identified as wolves either true or not (TP+ FP)

Recall however compares the True positive to the sum of false negatives (the times it was supposed to identify a wolf but failed to do so) . Recall= TP/ (TP+FN)

A good model would balance both its precision and recall to ensure that it's relatively good with all classes even if we have an imbalance in data observations between classes.

For that reason, we use a new metric F1 score derived from both precision and recall. Plotting the heights of both precision and recall (vertically) and drawing a diagonal from the top of each to the bottom of the other, the intersection point is called the harmonic mean. The model with the highest harmonic mean is the better overall performer.

Judging model’s performance is extremely important, and hopefully, the guard will start trusting our optimized classifier and save the campers' food in the future.

An important note; depending on the condition we might have a preference for a certain type of error over the other, its better to have more false positives than false negatives in the case of wolves. If we do not identify a wolf while there is one the cost is too high. So careful judgment needs to always be used while designing machine learning and AI models.

Sources:

  1. https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2
  2. https://www.youtube.com/watch?v=85dtiMz9tSo
  3. https://www.youtube.com/watch?v=HBi-P5j0Kec
  4. https://www.youtube.com/watch?v=aDW44NPhNw0

--

--