Machine Growth
Published in

Machine Growth

Intuition Behind Machine Learning Evaluation Metrics (Precision, Recall & Accuracy)

After several years working as data scientist, I notice that a lot of young graduates, tech leads, managers or even CTOs have no idea on how to evaluate the performance of ML models. When I present the performance metrics to them, they seem to know the metrics and then they reply me : I will test your model with some cases in my hand, to see how good is your model. Wow! this is so unscientific! Therefore, I prepare this blog to help them quickly understand the evaluation metrics. And if you were having the same problem, I hope your managers, leads, or CTOs have the chance to read this.

Now, I am going to introduce 4 primary-school terms: True, False, Positive & Negative. These 4 terms will combine with each other and evolve into secondary-school terms: Precision, Recall and Accuracy. Sound interesting ?!

Let’s start with Positive and Negative. Positive and negative are neutral words which mean the 2 words can be use to represent any status. For example, positive status could be booking, raining, accept; and the corresponding negative status would be no booking, no raining, reject.

For the hotel search example in my blog, I represent booking as positive and no booking as negative.

Predicted searches with bookings and Ground truth searches with bookings

Let’s assume that I have trained a ML model to predict whether hotel searches from users will convert to bookings. If my trained model make a prediction that a hotel search on Christmas will convert to a booking; then I will mark this prediction as Positive (Book) prediction. Similarly, if my trained model make a prediction that a hotel search on weekday will not convert to a booking, I will mark this as Negative(No Book) prediction. Now, you might have a big question in your mind “What if the predictions are wrong?”.

True positive, False positive, True negative False Negative

To answer your question, here comes the other 2 terms, True and False. When my positive prediction is correct, I will label that prediction as True Positive. Otherwise, I will label that prediction as False Positive (when the positive prediction is wrong). The same thing goes for Negative prediction. When the negative prediction is correct, I will label the prediction as True Negative and when the negative prediction is wrong, I will label it as False Negative. So far so good, right? You have reached an important milestone. Give yourself a big clap! You are now graduated from primary-school terms. Next, we will be moving the secondary-school terms: precision, recall and accuracy. Excited ?!

Let’s start our first class in secondary school : Mathematic.

Precision = True Positive / (True Positive + False Positive)

Recall = True Positive / (True Positive + False Negative)

Accuracy = (True Positive + True Negative) / (True Positive + False Positive + True Negative + False Negative)

Mathematically, these are the formula. But, what are the intuitions behind these formula?

Graphical Precision Explanation

From precision formula, we notice that the denominator is (True Positive + False Positive). Ahha! (True positive + False Positive) are the predicted searches with bookings know as predicted positive searches. If we had a deep thought about this formula, precision actually tells us the percentage of correctly predicted positive searches out of all predicted positive(correct + wrong) searches.

In other words, precision tells us how many predicted searches with bookings are actually correct.

Try to digest this first because the concept of recall is very similar to precision. After that, let’s us proceed with recall.

Graphical recall explanation

From recall formula, we also look at the denominator (True Positive + False Negative). Ahha!!! (True Positive + False Negative) are the searches with ground truth booking also know as ground truth positive searches. Let’s us refresh our memory of False negative. False negative searches are the searches with bookings in real world(ground truth) but have been predicted as searches with no booking(negative searches) and the predictions are wrong (False Negative). So, the picture becomes clearer and clearer, recall actually tells us the percentage of correctly predicted positive searches out of all ground truth searches with bookings.

In other words, recall tells us how many ground truth searches with bookings have been spotted/recognised by the trained model.

Finally, we reach accuracy. This is the simplest as it tells us how many ground truth searches with bookings and ground truth searches without bookings are predicted correctly out of all the searches.

Hopefully my explanation on precision, recall and accuracy can give you a clearer view on the intuitions behind these metrics. Depending on the business problems, you may choose different metrics to evaluate your ML models. And, in the end of my blog, I would like to rephrase the formula so that everyone who is reading my blog can remember these metrics easily:

Precision = True Positive / (All predicted as Positive)

Recall = True Positive / (All ground truth Positive)

Accuracy = (True Positive + True Negative) / (Everything)




Angel is hiding in the details

Recommended from Medium

Gradient Descent Optimization Techniques.

How To Design Seq2Seq Chatbot Using Keras Framework

Summary: Name Disambiguation in Anonymized Graphs using Network Embedding (CIKM 2017)

The revolution will be unsupervised and other takeaways from the RE•WORK Deep Learning Summit

Review of Machine Learning course by Andrew Ng and what to do next

Quick Introduction to Deep Learning

Transfer Learning!!

A Neural Network In Under 4KB

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Yeo

Alex Yeo

More from Medium

Fairness in American Courts: An Exploration of the COMPAS Algorithm

Significance of choosing an Error / Evaluation metrics Part — 1

Red Wine Quality

Randomized Optimization Algorithm Comparison