The Inaccuracy of Accuracy

The Importance of Understanding Your Classification Metrics

7 min readJul 13, 2020

So you’ve got your data. You’ve cleaned it up, investigated a little, and made yourself a classification model. That’s great! However, how do we know if it’s a good model? This question is what differentiates good and bad data science. Unfortunately, the answer to this question is typically: “it depends”. Exactly what it depends upon varies based on the data you’re working with and what you’re trying to find with it. This blog will not be able, and is not intended, to tell you if you have a good model. The goal of this blog is to help you better understand your model and hopefully point you in directions to improve it.

Accuracy: What is it?

Accuracy is probably the most common and intuitive metric used to grade classification models. It is essentially a measure of how many things your model predicted correctly. Mathematically it is:

Low accuracy: you’re model isn’t really performing well and failing to identify the “True” values in your data.

High accuracy: you are getting a lot of things right…but is that a good thing?

The intuitive answer is yes. The initial response to seeing a high accuracy score is typically excitement, and it’s easy to understand why. Your model is getting a lot of things correct so what could be so bad about that? Honestly, there could be nothing wrong with it, but it could also be what your model is incorrectly predicting that makes it a problem.

False Positives, False Negatives, and The Confusion of The Matrix

When scoring your model, each observation of your data is classified as one of four things in a binary classification model.

True Negative

True Positive

False Negative

False Positive

As mentioned above, it’s the True Positive/Negative values that determine our accuracy score. What can determine the validity of that accuracy score are our False Positive/Negative values. To help better understand this, it helps to visualize it which is where our confusion matrix comes in handy. Before we talk about the confusion matrix though, I think it’s important to understand what a False Positive and False Negative are and the differences between them.

Let’s think about crime for a moment. Imagine a murder trial where the jury is voting to convict the suspect. The suspect is sentenced to life in prison even though they were actually innocent. This would be an example of a False Positive. Now, imagine that same trial where this time the jury declares the suspect not guilty and it turns out the suspect had indeed committed the violent crime. This would be your False Negative. Depending on your data and the context, perhaps getting False Positives/Negatives are okay, and perhaps they should be avoided at all cost. This is where it becomes vitally important to be familiar with your data and what it is you’re trying to determine from it. Typically you want to favor one over the other or find a balance between the two, but that’s something you need to identify as you work with your data and start training your models.

Once you’ve made up your mind as to if your model should favor False Positives, False Negatives, or a balance, it then helps to visualize your results in a confusion matrix.

This confusion matrix is from my class project on identifying causes of accidents based off the Chicago Traffic Crash data. What it’s showing is the results of determining whether an observation was or wasn’t due to ‘driver error’. The y-axis here, or True label, are rows representing the ACTUAL values in the test data, and the Predicted label on the x-axis are columns indicating what your model classified the data as. Adding the values of a row give you the total number of True or False observations in the test data. This can be easier to conceptualize when represented with percentages which you can do by passing in the argument: normalize = “true” (Yes, for this instance true is NOT capitalized).

Here we can see that each row, when added together, totals to 1 or 100%. The diagonal from top left to bottom right is our True values, where (0,0) and (1,1) exist. As you can see, these boxes are bright yellow in this example because their accuracy is high. This model is classifying almost everything correctly with a very small chance of classifying a False Negative (0.034) and a slightly higher chance of classifying a False Positive (0.092).

It’s import to note that this matrix is different from Confusion Matrix #1 as this is a matrix on the training data and Confusion Matrix #1 is a matrix of test data. Look again at the first matrix. Is there anything there that stands out?

Our number of True Positives is extremely higher than all other values. This is due to class imbalance and the fact that in my data 90% of it was identified as “driver error”. Class imbalance is a huge reason as to not trust your accuracy score alone. While my training matrix looked excellent, the confusion matrix for my test data told a much different story.

Looking at just the accuracy score for my model, you’d assume it’s a great model because it had an accuracy score of 93%. What does the confusion matrix tell you though? Here, we see that 97% of our ‘driver error’ is being correctly classified. Yet, it’s also incorrectly classifying 68% of our ‘non-driver error’ data resulting in a large probability of False Positive identification. What we have here is an extremely biased model.

Going back to our crime thought experiment, would you be okay with a court system that correctly identified guilty suspects 97% of the time but falsely committed innocent people 68% of the time? Hopefully, you said no. That 68% is a pretty large margin of error. You will always have margins of error to some degree, and as stated previously, you’ll need to identify what type of error your model should prefer and then what an acceptable amount of error is that you’re willing to allow. Depending on your data and what it is you’re trying to identify, it’s possible that 68% might be acceptable for some cases, but more likely than not, and in my case, you will want a much smaller margin of error.

Accuracy and Friends

Accuracy can be great for getting you in the ball park, but it shouldn’t be the metric you base your entire model on. Two other popular metrics to use are Precision and Recall. Essentially, these are the measures of either how conservative or how bias your model is. These scores are inversely related, so when one goes up, the other usually goes down.

Precision can be thought of as your measurement of how conservative your model is.

The higher your precision score, the more conservative your model because the more information it needs before it’s willing to predict something as a True Positive. You would want a high precision score when you want your model to only classify things as positive when it’s absolutely sure it’s positive. If your data has a lot of outliers this may not be an effective metric, or it could help identify some determining factors of those outliers.

Recall can be thought of as a measurement of model bias. It can also be thought of as the “better safe than sorry” metric.

A high recall score indicates more bias because your model has a lower threshold for classifying things as positive. Imagine your model is being used to identify infectious disease. This would be a model where you most likely prefer a high recall score as it’s better to falsely identify the disease than accidentally fail to identify it. In my model above, I had a higher recall score than precision because my model was very likely to classify most things it saw as positive for ‘driver error’, hence my model being biased.

The last metric we’ll discuss here is the F1 score. The F1 score measures the balance between Precision and Recall.

If one score is much higher than the other, your F1 score will be very low. Whereas, if both scores are very similar, your F1 score will be roughly the same. So if both Precision and Recall are high and similar, your F1 score will be high as a result. If you’re looking for a metric that describes the general overall performance of your model, F1 score is typically the way to go.

Importance of Understanding

A lot of factors go into interpreting a model and even more so into adjusting and tuning one. Before you can fix a model though, it’s important to be able to understand it. Hopefully this helps you identify what it is you want your model to do and where it may be lacking in performance. If you can identify your problem areas you’re then better suited to find solutions for that specific problem. Confusion matrices are extremely helpful at visualizing your performance, but when evaluating based on metrics just always remember: