Plain English intro to your first AI project: why the idea of accuracy is dangerous and why 99% is probably not the number you need.
A large part of my job as an AI consultant is to disappoint people. Things would be much easier if I were a plumber or a taxi driver — no one asks you to upgrade a shower to an espresso machine or to drive from L.A. to Sydney through the Earth’s core — but with AI, people ask for such things all the time. The unofficial king of such questions is: “Can you make it 99% accurate?”
I can’t count the times I’ve watched people’s hearts sink after the inevitable answer, which is: “most likely not”. Some conversations end immediately, some continue in an effort to convince me that I should reconsider my stance, and a few eventually evolve into a discussion about what AI can actually do. This has become such a frequent exchange that I have decided to write out my rationale in order to (hopefully) inspire more productive conversations around AI.
You may say: “Wait, but so many people claim 99% accuracy in so many areas!” Except… they don’t, and it’s usually the media who misinterpret the results. Or they do, but they’re not telling the whole truth, such as companies overselling their AI products. I’m not saying all of them, but I insist I’m 99% accurate with that observation.
The requirement for a system to be 99% accurate has two major problems: “99%”, and “accurate”. In the following chapters, I will explain why the idea of accuracy is dangerous and why 99% is probably not the number you need.
The “Accurate” Problem
How do you define the accuracy of an AI system? The textbook definition is simple: the percentage of correct predictions. Well, consider the following problems:
- predicting stock prices
- ordering search results
- detecting objects in a video
- recommending a product
- translating a document
- transcribing audio to text
These problems (and many more) share an interesting trait: they have no simple hit-or-miss solution. Imagine a system that predicts stock prices. It’s never exactly right, but it consistently predicts the prices within a margin of +-1 penny. Is it a correct prediction? Technically not — in fact, the system’s accuracy would be 0%. Yet, I suspect most users wouldn’t object to the 0% accuracy and would rather enjoy early retirement on their newly acquired private islands.
Or take a (silly) machine translator that would translate the French phrase “Je m’excuse” as “I excuse myself”. If you don’t speak French, it’s definitely better than nothing, and you would certainly prefer that translation over “I am excuse” or “Jee me execute”. At the same time, a language guru might still feel there’s room for improvement. Was our machine translator correct? It’s obviously more complicated than yes or no.
By definition, accuracy only makes sense when answers are black or white. For the problems listed above, more suitable metrics exist which reflect the “non-binary” nature of the predictions*. Here are some black-and-white problems where accuracy can be used:
- detecting proper names in text
- highlighting dangerous people at airports
- spotting fraudulent transactions
- diagnosing tumors from x-rays
- preventing spam emails
Let’s say you’re responsible for the development of a new AI security system at an airport. It should analyse camera streams in order to spot terrorists based on suspicious behaviour patterns. Every person at the airport is classified into one of the two categories: terrorist, or non-terrorist. Would you buy a system that boasts with 99.99999% accuracy at spotting a terrorist? I can code such system for you in less than one hour, but I warn you: you might find your airport on fire pretty soon. Since terrorists are extremely rare and accuracy is the percentage of all correct predictions, all I need to do is to classify everybody as a non-terrorist, including the bad guys. Without any AI in place, I’ll make an overwhelming majority of correct predictions. Thank you very much, enclosed is my invoice.
Obviously, this is a stupid example and you would never buy it. What you imagined under “accuracy” wasn’t the percentage of all correct predictions, but the percentage of actual terrorists correctly identified. Fair enough, I can make you such a system in less than an hour, too — I’ll just flip the logic and classify everyone as a terrorist. This time, our system got even better, because it correctly identifies 100% of terrorists, exactly as you wanted! Only the queue at your security checkpoint is now a few hundred miles long and your security guys are all about to quit.
This is not an ideal situation either, so you might want to reduce the number of false positives. We can start with some common-sense rules, such as discarding toddlers, as toddlers typically follow their own evil agenda and largely neglect geopolitics. If simple rules are not enough, we can finally employ some sophisticated AI because now — and only now — it can make a difference and we’re able to measure that difference.
Bottom line: accuracy is useless. It’s not just that the right metric is called differently than you might have thought, it’s the concept of describing a system with a single percentage of correct hits that is fundamentally flawed. For most AI problems, you can’t even use anything that resembles “accuracy” by definition. And where you can, you’re much better off with recall (percentage of terrorists correctly identified) and precision (percentage of actual terrorists out of all people classified as terrorists).
The “99%” Problem
Now that you have the right metric for your task, it’s time to talk numbers. Most people would probably agree that 99% is a great result of pretty much anything. But my experience is that this requirement is seldom substantiated and is usually just an intuitive proxy for “really good”. If you’re a rational person like me, you probably feel the itch for additional arguments. Let me propose a few.
First thing to understand about any result you get: it has been achieved on a certain dataset. The internet is full of sensational headlines like Google AI Claims 99% Accuracy in Metastatic Breast Cancer Detection. Let’s leave aside the fact that the metric the scientists reported has nothing to do with accuracy of a real system**. What’s even more misleading is that it sounds as if the problem of breast cancer detection were solved. Which sounds great, until you read the paper, which says: “images were obtained (…) from 399 patients (…) [the proposed system] was developed by using 270 slides and evaluated on the remaining 129 slides” (emphasis mine). I’m not a professor of statistics, but this doesn’t sound like the world can finally move on. For better or worse, this is going to be the case for your AI project as well. Until it’s measured against real-world data, it’s always going to be a rough estimation of “what it could be like”, based on a limited dataset that you provided to your team.
Another argument against 99% is, ehm, the human nature. History has consistently shown that humans don’t perform well at agreeing with each other. Wherever you’re dependent on humans to teach your system, especially in highly subjective tasks such as mental health diagnosis, legal judgement, or macroeconomic analysis, you can bet your bottom dollar on a lot of controversy. And if humans cannot agree on the “right” answer, how can you expect anything close to 99% from a machine?
For machine learning practitioners, 99% is an unlucky number. You don’t see anybody celebrating it. Instead, you see people scratching their heads, staring melancholically at their screens, thinking: “Why me?” That’s because 99% often flags the dreaded problem of overfitting — simply put, when everything works great on your training data but sucks in practice.
Don’t pull your expectations out of thin air. Instead, make sure you understand your baselines and ask the right questions:
- What is the current human performance?
- What can we achieve with a naïve solution?
- Is there a state-of-the-art solution we can follow?
- What is the inter-rater agreement? Do even humans agree on the correct answers?
- What are the costs of a wrong prediction?
- How can we design the user experience to acknowledge and communicate uncertainty?
If you’re responsible for an AI project, understanding your metrics is your most important homework. The good news is that you’re familiar with hundreds of metrics already — from sports betting odds through GDP to EBITDA and ROI. You only need to add one or two more to your toolkit. You cannot manage what you cannot measure, and there’s really no excuse.
This article will have the right impact if you:
- Find a legitimate metric instead of accuracy and never look back.
- Start thinking about AI expectations in a more structured way.
If you manage to impress your team with the knowledge acquired hereof, kindly send pictures of their happy faces to: firstname.lastname@example.org. I’ll feel rewarded.
Many thanks to Griffin Trent, John Lowe, Carlos Dreyfus and Eduardo Cerna for helping me shape this article.
* These might be for example: Mean Absolute Error (1), Mean Average Precision (2–4), BLEU (5), and Word Error Rate (6).
** The researchers only report AUC, or Area Under ROC Curve, which summarizes all possible behaviours of a classifier in a single number. Besides obviously not being accuracy, this metric is quite problematic itself, as explained here and here in great detail.