# Explaining Machine Learning to a 10-Year-Old. Confusion Matrix and Performance Metrics

## A simple, graphical explanation of confusion matrix, prevalence, accuracy, precision, recall

As a data scientist, my work is not only building AI products but also making their outcome explainable to people that are not practitioners. This is why I thought to write a series of posts called “Explaining Machine Learning to a 10-Year-Old”: the idea is to keep things “as simple as possible, but not simpler” (I know Einstein didn’t actually say that but, hey, it’s a good quote anyway).

In this post, we will not go into detail about how a machine learning model works (we will see this topic in other posts). Rather, we will start from the end of the story: we will imagine that we already have a predictive model and see whether it’s a good or a bad model.

Since my 10-year-old nephew loves to go fishing, I will be using the metaphor of a fisherman (it turned out to inspire a useful, graphical representation of a confusion matrix).

# Tale of a fisherman

The fisherman holds a Ph.D. in Machine Learning, so he likes to call the pond his “test set” because this is the place in which he will test the effectiveness of his machine learning model.

This is how the pond looks like:

As you can see, the fisherman calls fishes “positives” and boots “negatives”. Why is that? Because data scientists call “positive” each item that answers “yes” to the question that they are asking. The question that the fisherman wants to ask his model is: “Is this item a fish?”.

Every machine learning model is meant to answer a simple question. For instance:

# Confusion Matrix

Eventually, the fisherman throws his net. This is the result:

The net collected 16 items:

• 9 of them are actually fishes, these are called “True Positives” (meaning that the net has done a correct guess on them),
• 7 of them are instead boots, these are called “False Positives” (meaning that the net believed them to be positives, but this was a mistake).

The net decided not to collect the remaining 84 items, believing that these were more likely to be boots than fishes, however:

• 68 of them are actually boots, these are called“True Negatives”,
• 16 of them are instead fishes, these are called “False Negatives”.

# Performance Metrics

Good news: all that is needed to calculate performance metrics (PMs) are the four quantities contained in the confusion matrix. There are tons of PMs (see here), but we will go through 4 of them, the most important ones.

## 1. Prevalence

First things first. How many fishes are in the pond? Prevalence is the ratio of positives in the test set.

Strictly speaking, this is not a PM. Indeed, it doesn’t depend on the model: it just depends on the data. However, it’s the first thing that the fisherman wants to know. It’s a benchmark because it’s the probability of catching a fish if you extract an item completely at random. For example, if the prevalence was lower — let’s say 5% — catching a fish would be far too hard. And this should be kept in mind when looking at PMs.

## 2. Accuracy

How good is the net in guessing whether an item is a fish or a boot? Accuracy counts how many items have been correctly classified by the model (so, how many true positives and true negatives compared to the number of observations).

This is one of the most used PMs, but it could turn out to be tricky because it treats true positives and true negatives equally. However, in the real-world there is a difference between positives and negatives.

For instance, what if the net caught no items at all? Well, it would have a great accuracy (75%), but the fisherman would starve.

## 3. Precision

How precise is the net in classifying fishes? Precision is the number of true positives compared to the number of items classified as positives. Note that if precision equals prevalence, then the model is basically doing random choices, so it’s helpless. Precision has to be (substantially) greater than the prevalence for the model to be useful.

However, precision alone is not enough. For example, who needs a fishnet that is 100% precise if it’s able to catch only a handful of fishes? The fisherman could brag about his super-precise model, but he would still be pretty hungry. This is where recall comes into play.

## 4. Recall

How many fishes is the net able to collect, out of all the fishes that are in the pond? Recall is the number of true positives with respect to the number of positives.

Precision and recall are two sides of a coin. There is a trade-off between the two. Typically, a small fishnet will be very effective in recognizing fishes (high precision) but will collect a few of them (low recall). Enlarging the net will necessarily increase the number of fishes caught, but the net will be less precise in separating out fishes from boots. So, recall will rise, but precision will drop.

Which pair is the best between the following:

• Precision: 56 %. Recall: 36 %.
• Precision: 80 %. Recall 10 %.

Well, I don’t know.

A bunch of questions should be addressed when choosing the best combination of precision/recall. How costly are false positives? How costly are false negatives? And so on. These topics are highly related to the problem under consideration. If we are dealing with a business application, business knowledge is indispensable to answer these questions and to make a choice.

— — — — — — — — — —

Written by