Why you should not trust only in accuracy to measure machine learning performance

5 min readSep 6, 2018

From June 2020, I will no longer be using Medium to publish new stories. Please, visit my personal blog if you want to continue to read my articles: https://vallant.in.

The accuracy seems to be — at first — a perfect way to measure if a machine learning model is behaving well. It can be used in classification models to inform what’s the degree of predictions that the model was able to guess correctly.

The accuracy is simple to calculate. You can check the accuracy of your model by simply dividing the number of correct predictions (true positives + true negatives) by the total number of predictions.

So, let’s analyse an example. Imagine you have to make 1.000 predictions. At the end of the process, your confusion matrix returned the following results:

This is not bad at all! With your model, you got an accuracy of 92%. But…wait. What happens if you decide simply to predict everything as true? You won’t use any model this time. Only assign true to ALL the predictions. What happens?

If you do it, you STILL get a good accuracy. It dropped a little, but 88.5% is a good score.

And that’s why the accuracy only is not a trustful to evaluate a model.

What you have to keep in mind is that the accuracy alone is not a good evaluation option when you work with class-imbalanced data sets.

In fact, in this example, our model is only 3.5% better than using no model at all. So, why to use a model if you can randomly guess everything? How to know if a model is really better than just guessing?

The baseline

That’s why you need a baseline. A baseline is a reference from which you can compare algorithms. The notion of good or bad can only be applied if we have a comparison basis.

In order to create a baseline, you will do exactly what I did above: select the class with most observations in your data set and ‘predict’ everything as this class. Then, you will find out what would be your accuracy if you didn’t use any model.

Class-balanced data sets will have a baseline of more or less 50%. But the vast majority of data sets are not balanced. And even when they are, it’s still important to calculate which observations are more present on the set.

If your accuracy is not very different from your baseline, it’s maybe time to consider collecting more data, changing the algorithm or tweaking it. Or maybe you just have a very hard, resistant to prediction problem.

Finding the CAP

The CAP, or Cumulative Accuracy Profile, is a powerful way to measure the accuracy of a model. It represents the number of positive guesses made by the model in comparison to our baseline.

Let’s see an example. Imagine you work for a company that’s constantly s̶p̶a̶m̶m̶i̶n̶g̶ sending newsletters to their customers. Let’s say that usually, 5% of the customers click on the links on the messages. You don’t do any specific segmentation. You just send your emails.

Now, you have deployed a brand new model that accounts for the gender, the place where the customers live and their age you want to test how it performs. You send the same number of emails that you did before, but this time, for the clients you believe will respond to your model.

These are the results:

The blue line is your baseline, while the green line is the performance of your model. It means that your model was capable of identifying which customers will better respond to your newsletter.

But wait, imagine that you are a magician and that you are capable of building a WOW model. In this scenario, you would have the perfect CAP, represented now by a yellow line:

In fact, you evaluate how powerful your model is by comparing it to the perfect CAP and to the baseline (or random CAP). A good model will remain between the perfect CAP and the random CAP, with a better model tending to the perfect CAP.

CAP Analysis

A good way to analyse the CAP is by projecting a line on the “Customers who received the newsletter” axis right where we have 50%, and selecting the point where it touches our model. Then, check on the ‘Customers who clicked’ axis what’s the corresponding value.

If you have a ‘X’ value that’s lower than 60%, do a new model as the actual one is not significative compared to the baseline.
If your ‘X’ value is between 60% and 70%, it’s a poor model.
If your ‘X’ value is between 70% and 80%, you’ve got a good model.
If your ‘X’ value is between 80% and 90%, you have an excellent model.
If your ‘X’ value is between 90% and 100%, it’s a probably an overfitting case.

Conclusion

The accuracy is a simple way of measuring the effectiveness of your model, but it can be misleading. Don’t trust only on this measurement to evaluate how well your model performs. Try other measures and diversify them. You don’t have to abandon the accuracy. Just realize that sometimes it’s not telling the all history.

Why you should not trust only in accuracy to measure machine learning performance

The baseline

Finding the CAP

CAP Analysis

Conclusion

Written by Wilame