Classification Performance Metrics for Machine Learning

Mattison Hineline
7 min readFeb 6, 2023

--

Defining how well your classification model did in easy-to-understand terms.

Photo by La-Rel Easter on Unsplash

In this post, we’ll discuss the most common classification performance metrics — what makes a model good? We will start by covering what defines classification models, key concepts to understand how we achieve metrics, and finally the performance metrics. We will cover accuracy, precision, recall, specificity, f1-score, and ROC-AUC. If you’re ready to learn, read on and get some knowledge!

Classification vs. Regression

Before we jump in, let’s define what a classification problem is. In machine learning, there are two overarching types of models: classification and regression. Regression models are when we try to predict a numerical value. For example, for a regression problem we could ask “How much will this house sell for?” In comparison, classification models try to classify the data into groups. We could ask, “Is this a picture of a dog, cat, or bird?” or “Does this patient have cancer?” In other words, a regression problem answers continuous variable questions and a classification problem answers discrete variable questions. (If you have a regression model, check out my post here.)

Confusion Matrix

For us to understand the following metrics we must first understand the confusion matrix. We will cover this very briefly in this post, but you can find a more detailed account of confusion matrices in my previous post. In a nutshell, a confusion matrix gives a an easy visual and understanding of the four components of how a well model did with respect to each class. From it, we can get the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Again, if this doesn’t seem familiar to you please check out my previous post on the confusion matrix.

Note on accuracy and precision

Before we move into accuracy, precision, and the other metrics, let’s look at a quick overview of just the first two: accuracy and precision. Take a minute to look at the target practice image below. We can quickly see that being accurate and precise are the ideal outcome. But we can also have one without the other, or neither. It’s important to keep this picture in mind that as you learn about accuracy and precision; just because you have a precise model (but not accurate) or, in contrast, an accurate model (but not precise) does not necessarily mean you have a good model. You always need to take into account what your project goal is and report the correct metric for your situation.

Without further adieu, let’s jump into the metrics!

Accuracy

Accuracy is a common word but what does it really mean, especially with regards to machine learning? In short, it is the proportion of correctly classified values out of all outcomes. It gives us a fairly good overall picture of how well our model did. That is why accuracy is a fairly good predictor in many cases. We can calculate accuracy with the formula below.

Before we get too excited about accuracy and its “overall performance metric”, we need to keep in mind the instances of when accuracy is not a good metric to use. Specifically, when working with unbalanced datasets accuracy will not be a good choice. Even if the model does poorly, accuracy can say that the performance is 90–99% accuracy! That is very misleading. To set this in more concretely, let’s look at a short example.

Imagine you have 100 patients that are worried they have a deadly disease, such as cancer. All 100 patients come in to our lab and we decide to test our new classification model. Ninety-eight (98) of the patients are completely healthy (i.e. no cancer at all), but two (2) do have a high stage cancer and need medical attention soon to combat it. We run our model (predicted labels) and compare it to the doctors results (true labels). We see that our model gets 98% accuracy! Wonderful! But wait, after we dig a little deeper, we see that our model classified everyone as healthy. The accuracy was calculated by (98 correct / 100 total patients). If we were to trust our 98% accuracy model, we would have sent those two sick patients home to die when the model says they don’t have cancer!

This may seem like an extreme example, but it really highlights how accuracy can be misleading. Be sure, in your own projects, to be diligent and always choose the right metric. As for the outside world, I always recommend researching how statistics you find get their numbers prior to trusting it at face-value.

Precision

Precision is the proportion of correctly identified positives to all predicted positives. This tells us how well the model does at detecting relevant values.

Let’s take a common example, spam filters for emails. In this case, we would want a high precision (with potentially risking a lower recall). Imagine you are awaiting an important email to arrive. If your model filters out a “normal” email (i.e. it is classified as “spam”) then you will not receive the important email. This would be a high false positive rate, and a low precision. However, if we have a high precision then we may be okay with receiving a few spam emails as long as we always get our important emails.

Recall

In contrast to precision, recall (also known as sensitivity) is the true positive rate. We want to know the proportion of positives that were correctly classified out of all true positives.

Again, let’s check out another common example: cancer detection. If we use our classification model to decide who does and who doesn’t have cancer, we will want to always want to error on the side of caution. Essentially, even if a person doesn’t truly have cancer but our model says they do, it is better to run a few extra tests than to send that patient home thinking they are healthy. This is high recall: we may have more false positives as long as we catch more true positives.

F1-Score

We can see how precision and recall counteract each other a bit. This can cause issues when deciding which metric to use, especially if the stakes aren’t as high as deadly cancer detection! For this reason, we can use a combination of both precision and recall in a metric called the F1-Score.

The F1-Score is a weighted average of precision and recall, which gives you a good overall metric for determining model performance. The F1-Score uses both the false negative and false positives, which also makes it a good metric to use on unbalanced datasets (good for when you can’t use accuracy!).

Specificity

Specificity is used to identify the negative prediction rate. It is the percentage of true negatives correctly classified as negative.

This could be a useful metric to use along with recall and/or precision. Imagine we have our spam detector again with high precision. If we also have a high specificity, then we know that when the model classifies something as “not spam” then we can be fairly confident that it is not spam. Similarly, if we have our cancer detection model with high recall with a high specificity too, we can be confident that when a patient is diagnosed as “non-cancerous” that it is probably correct.

ROC-AUC

The Receiver Operating Characteristic curve (ROC curve) shows us how the classification model does at different thresholds. It is a probability curve. Along with the ROC curve, we have the Area Under the Curve (AUC), which tells us area under the ROC curve. Generally, we want to have a AUC score higher, which means the model is good at predicting positives and negatives. Visually, the closer the curve is to the top left corner, the more area under the curve, which means the higher an AUC score.

To plot the ROC curve, we need the True Positive Rate (TPR) which we can get directly from recall (they’re the same!). We can get the False Positive Rate (FPR) from doing 1-Specificity.

Closing Thoughts

Overall, depending on your classification problem, you will need to decide which metric to use and try to optimize it! We just covered a lot of classification performance metrics so now you may be wondering, what if I have a regression problem? In my next post, we’ll cover regression performance metrics! Check it out here.

Thank you for reading! Find me on Linkedin

I also would greatly appreciate any feedback, constructive criticism and input on what you read. Feel free to reach out to me via Linkedin or comment directly on this post.

--

--