Accuracy is not accurate.

7 min readFeb 18, 2022

Why evaluation metrics matter and why you should beware of the man of one metric.

Too long didn’t read.

I am writing this article because the layman is guilty of committing sins of probabilistic thinking at a rate that would leave God no option but to banish his entire clan to hell, and it would be very embarrassing for students who are learning ML and data science to actually commit those same sins.

But also ‘accuracy’ is a word that is thrown in our face all over the place by businesses and advertising. ‘Our product is 99% accurate at doing this’, ‘We can find that with 99.9% accuracy!’. I shudder at the thought of the millions of dollars of wasted venture capital funding the next sham startup that claims to predict the next unpredictable thing with shockingly high accuracy or at the thought of countless actual good projects thrown to the side by the statistically illiterate managers in favor of projects with more “accuracy”.

Actual tldr;

Never rely on only one evaluation metric.
Choose the correct metric for the task.
Be aware of reasoning traps and statistical fallacies/paradoxes.
Number 3 doesn’t only apply to data scientists, But everyone.

Who cares?

If you have to build a Machine Learning or Statistical model that predicts anything ever for any level of rigor; let that be a school project, a scientific paper or production codebase you should care, because using the wrong evaluation metric can not only make your predictive model entirely useless at what it should do, but often times entirely counterproductive.

Using the right metric to evaluate something isn’t only limited to Statisticians or Data scientists or Scientists for that matter, just making sense of the world (and what you read on the news) requires some understanding of how to make sense of evaluation metrics (especially binary classification ones’).

Suppose you own a shop and need to hire a security guard. You have two guards to choose from. Guard A and guard B. A is an optimist; his answer to “Will my shop get robbed today?” is always “no”. B will say “yes” 10% of the time. In reality your shop gets robbed 1% of the days. Notice how A’s predictions will be 99% accurate, while B’s only around 90% accurate (on average). Also notice how guard A is the same as having hired no guard at all. So what gives? How is hiring no guard at all more accurate? It’s obvious in this case that it’s not how the guard acts on the uneventful days that matter but the days you do get robbed. However, this distinction of the accuracy of predictions leading us astray is not so obvious when the situation gets more abstract.

Heuristics That Almost Always Work

He works in a very boring building. It basically never gets robbed. He sits in his security guard booth doing the…

astralcodexten.substack.com

The blogpost above (other than inspiring the previous example) explores the phenomenon of how being correct 99% of the time isn’t worth anything at all most of the time and why understanding that is often of life or death importance.

The Base Rate Fallacy

Suppose you think you have some ultra super mega scary disease and your doctor thinks you need to get tested for it, and you do, and the test came back positive, The test is 95% accurate. Time to say goodbye to all your loved ones?

Not so fast.

How scared you should be is directly proportional to how many other people have that disease. But what? The test is 95% accurate?! Here’s the catch, P(you have the disease | positive test) is conditional, and that means the conditional flipped is present on the other side of the equation. The fallacy is in failing to consider the probability being conditional.

And that’s where the devil lies.

Back to the equation that decides your fate on this Earth. Let’s assume the test only spits out false positives at a rate of 5%, i.e. 95% accurate. If the population has an infection rate of 40%. Let’s plug these numbers into Bayes Theorem.

All is well so far. You are 93% sure that you do in fact have the disease. But what if the disease was much rarer? Let’s assume in this case the prevalence of the disease is 2%.

Only 30% now. Still scary because the expected value of a really bad disease with a low probability is still really unfavorable, but in the real world the tests are more accurate and the diseases rarer, so you (probably) get to live another day. On a side note, I had tremendous difficulty explaining this to people that those got a positive PCR test result for the virus that causes Covid even when the positivity rate of the disease was well below 0.3%, which means a large chunk of the positive tests were false positives, and without a second positive test, no conclusions should have been jumped to.

Just for fun lets see what the base rates would have to be for P(A|B) to converge to P(B|A).

Playing around with the code encouraged.

As evident you need a really high base rate for a test of 95% accuracy for the conditional probability to converge.

Nonetheless, this is a classic case that comes to mind where the accuracy of a test leads you off the rails; The probability of what you are predicting actually being true matters just as much as the probability of your prediction of it being true. For data scientists there might not be much insight to gain from knowing about the base rate fallacy for model building, knowing it certainly won’t hurt during the EDA phase, especially if you are working with humanities and social science data where it can bite you especially hard.

What now?

Let’s assume we are talking about binary classification for now. Abandon the accuracy dogma and embrace other metrics.

Accuracy

Use accuracy when your test set is balanced and the models objectives are not tarnished by the influence of false positives or false negatives. For example if you are classifying images of objects, accuracy of the classifier is a good evaluation metric.

To further expand on the ‘unbalanced testing set’, notice how in the security guard example ‘real world data’ is an unbalanced testing set, there are far too many of one label compared to the other. It’s obvious how that lead our shop owner astray there on top of sneaky statistics wording.

If you ever find yourself in the position of the junior data scientist from the meme in the beginning of the article with a world class performing model on your hand;

Check your testing set, its probably severely unbalanced.
Make sure you are not testing on the training data.
Check the training set too, the model might be overfitting.
Explore other metrics.

Precision

Use precision when the outcome is more sensitive to false positives than false negatives. Usually for classifying things related to peoples tastes, where false positives might turn them off, such as wrong product recommendations.

Recall (Sensitivity)

When a false negative really hurts. Such as security guards for shops. Or life-guard robots, or tests for diseases, you get the point.

F1-score

When false positives and false negatives are unwanted. Tradeoffs apply.

Are those it?

No. But understanding and internalizing the different use cases of the first 3 above are the first step. To delve in further into classification metrics this Wikipedia article is a great place to start.

Confusion matrix - Wikipedia

condition positive (P) the number of real positive cases in the data condition negative (N) the number of real negative…

en.wikipedia.org

Are those of us doing regression analysis safe?

No. Over reliance on one metric or the wrong metric is just as big of a problem for regression as it is for classification.

There is much ink to be spilled on how R2 (coefficient of determination) is the accuracy of regression, and at the root of many modelers false confidence. But that’s for another day.