ML Core Concepts: Precision and Recall

Emily Strong
The Data Nerd
Published in
2 min readJul 11, 2022

At every data science job interview I have ever had, from grad school internships to senior data scientist roles, I have been asked to define precision and recall.

The question is a perennial job interview favorite because these two classification metrics are so commonly used in industry. They are easy for business users to grasp and are agnostic to the particulars of the use case. Even working on recommender systems, we use a slight variation in the form of Precision@K and Recall@K.

So, how do we define these two terms?

Retrieved and relevant elements

Consider the above visualization. We have a field of relevant items and irrelevant items, and the model retrieves some of both (the true positives and false positives). The irrelevant items that were retrieved are false positives, or Type I errors.The relevant items that were not retrieved are false negatives, or Type II errors.

Recall measures how good of a job the model does at identifying all of the actual positive items. Another way to phrase that is, of all of the positives, how many of them are predicted as positive by the model. It is also known as sensitivity or true positive rate, or how sensitive the model is to the true positives. In our visualization of retrieved versus relevant items, this is the proportion of relevant items that are retrieved, or TP/(TP + FN).

Precision measures how good of a job the model does at identifying only true positives. In other words, how many of the positive predictions are correct. This is also known as the positive predictive value. In our visualization above, this how many retrieved items are relevant, or TP/(TP + FP).

A closely related concept is specificity, which is the true negative rate. When both Type I and Type II errors are low, the recall, precision and specificity are high.

There are times, when you might have a very low precision and high recall, or vice versa. A useful metric when this occurs is the F1 score. This is known as the harmonic mean of precision and recall and reflects the overall accuracy of the model. It is calculated as:

2 x (Precision x Recall) / (Precision + Recall) 

or

2 TP / (2TP + FP + FN)

Having one metric be low while the other is high often occurs when dealing with imbalanced classes and anomaly detection. Because it gives a more holistic picture of the model performance, F1 score is often used as an alternative to accuracy in combination with precision and recall.

Classification metrics and other key concepts for working with models in real-world settings are covered in my Machine Learning Flashcards: Modeling Core Concepts deck. Check it out on Etsy!

--

--

Emily Strong
The Data Nerd

Emily Strong is a senior data scientist and data science writer, and the creator of the MABWiser open-source bandit library. https://linktr.ee/thedatanerd