Precision Versus Recall — Essential Metrics in Machine Learning

In machine learning, precision and recall are metrics used to evaluate how well a model performs. This article explains what they are in detail.

Published in

Geek Culture

12 min readSep 16, 2022

In machine learning, precision and recall are metrics used to evaluate how well a model performs. This article explains what they are and answers the popular “ precision versus recall” question.

We’ll go over how to calculate precision and recall. We also discuss specific examples of each and why you should use both precision and recall to evaluate your model’s performance.

It would be best to use the most accurate machine learning model when predicting some outcome. That is, of course, to minimize your ML model’s mistakes.

Regardless, some of your AI model’s mistakes will have a more significant impact than others. It is not always in your best interest to make the least mistakes possible; sometimes, it is vital to know where your model is wrong.

Or, to put it differently, to understand which mistakes you can live with and which mistakes you want to minimize.

Let’s dive deeper.

Every machine learning model is wrong sometimes, and that is a fact. Imagine that our objective is to develop a machine-learning model that can predict the presence of cervical cancer as best as possible.

The apparent goal is to

Identify as high a percentage as possible of the cancer cases
with as few as possible “false positives” (prediction of cancer but, in fact, no cancer)

It is far preferable not to disregard anyone with cancer, even if that means flagging some patients that do not have the disease as having cancer.

And that is precisely the difference between focusing on precision versus recall as a machine learning model metric.

How To Read Confusion Matrix

Confusion Matrix is a performance measurement for machine learning classification problems. In case the output can be one of two classes (Cancer YES/NO, Upsale YES/NO, Customer Converted TRUE/FALSE, etc.), the confusion matrix is a table with four different combinations of predicted and actual values.

Imagine we have thousands of rows of medical data from patients labeled as ‘Cancer YES / NO.’

We also have 1848 rows of test data, so we can test how well the model performs. The results are organized in the confusion matrix:

In order to choose a suitable machine learning model and make informed decisions based on its predictions, it is essential to understand different measures of relevance.

Precision is the percentage of correctly predicted instances relative to the total number of cases.

In contrast, recall is the percentage of correctly predicted instances concerning the total number of relevant cases.

High precision and high recall mean that your model is performing well.

Precision-Recall Formula

Precision- Out of all the examples that were predicted as positive, how many are positive?
Recall — Out of all the positive examples, how many were predicted as positive?

What Is Precision In Machine Learning?

Precision is an essential factor to consider when assessing the performance of a machine-learning model. It is defined as the proportion of true positives to all positive predictions, including false positives and true markers.

Out of all patients that were predicted as having cancer, how many actually do have cancer?

Low precision means that our machine learning model will predict some FP — false positives. It will label some patients as if they have cancer when in fact, they don’t have it. It’s not ideal, but that type of error is not life-threatening.

The precision is a measure of how many detected items are truly relevant.

For example, imagine you’re trying to predict whether or not a patient has diabetes. If you only test for diabetes once, there’s a chance of a misdiagnosis-the patient may not actually have diabetes if they only happened to test at low blood sugar levels on that day. However, their results may appear similar enough to patients with diabetes, so your test will deliver the diagnosis.

But if you test twice-once when the patient has high blood sugar levels and once when the patient has low blood sugar levels-you will get an average result that will lead to a correct diagnosis.

The same principle applies to other types of predictions as well. If you’re trying to predict a person’s height, for example, testing only once might lead you to an incorrect prediction because the person may be wearing thick-soled shoes at the time of measurement.

But if you test multiple times by measuring their height while wearing various footwear-it becomes easier to find an average that is more accurate than any single measurement.

Machine Learning Recall Definition

Recall is another critical measure of machine learning success. It’s a way to gauge how many correct items were found compared to how many were actually there.

Out of all the patients that do have cancer, how many were predicted correctly?

Low recall means that our machine learning model will predict some FN — false negatives. It will label some patients that really do have cancer as not being sick. That type of error is life-threatening.

Another example, you want to evaluate 500 pictures to determine how many have a cat in them. You will likely miss some because the cat is hidden in the background or too small to be evaluated. In this case, your recall rate is lower than your precision rate.

Recall is a measure of how well you can find the items that you’re looking for.

It’s not always possible to find every single item or data point, so a 100% high recall rate is rare. But you want your percentage to be as high as possible.

What Is the Difference Between Precision and Accuracy

Accuracy vs. precision is a common topic in the field of machine learning. It can confuse new data scientists, so we’ll break it down for you.

When training your model, you need to decide on your loss function and how much weight you want to place on each type of error.

If your model is too inaccurate, it won’t deliver accurate predictions for any test set data.
If your model is too precise, it will make predictions that are precise but not accurate-so. It will give an answer that is always true or always false but not accurate enough to be useful in practice.

It would help if you also considered whether or not you’re interested in performing inference (making predictions on new data) or only predicting outcomes based on past data. Suppose you want to make inferences with new data. In that case, you should also consider how much precision is appropriate for those predictions and what kind of error rate would be acceptable in practice.

If you are only interested in making predictions based on past data, think about what kind of error rate is acceptable. If there’s a chance that your model will make mistakes when predicting the future, then it can help to know what kind of errors those might be.

For example, if your model predicts that a person has cancer with 100% certainty (based on their symptoms and medical history), but some patients don’t actually have cancer-this would be an unacceptable error rate.

Machine Learning Accuracy

For the following examples, consider the confusion matrix:

Correct Predictions

202 in total out of 262 test rows. This is defining Model Accuracy = 77.1%

True Positives (TP) = 65: a row was 1, and the model predicted a 1 class for it.

True Negatives (TN) = 137: a row was 0 and the model predicted a 0 class for it.

Errors

60 in total out of 262 test rows, 22.9%

False Positives (FP) = 29: a row was 0, and the model predicted a 1 class for it.

False Negatives (FN) = 31: a row was 1, and the model predicted a 0 class for it.

The simple formula for accuracy is as follows:

Accuracy, (TP + TN) / TOTAL.

From all the classes (positive and negative), 77.10% of them we predicted correctly.

Accuracy should be as high as possible.

From all the patients (with or without cancer), how many of them have we predicted correctly?

How to Calculate Precision in Machine Learning?

Precision, also known as a positive predictive value, measures how well a classifier predicts the positive class.

It is calculated as the number of true positives divided by the total number of all positive and negative observations. This value ranges from 0 to 1; a higher score indicates better balance.

Precision, TP / (TP + FP).

From all the classes we have predicted as positive, 69.15% are actually positive.

Precision should be as high as possible.

How to Calculate Recall in Machine Learning

Recall, also known as sensitivity, is calculated by dividing the number of positive samples classified correctly as positive by the total number of positive samples.

It measures a model’s ability to detect positives-the higher its recall, the more positives are detected.

To calculate recall, you must first determine what constitutes a “positive” sample. For example, if you’re working with medical data, you can define “positives” as patients who have been diagnosed with a particular disease or condition. Then, use your model to classify those patients as either “positive” or “negative” based on the information provided by your algorithm.

After that, count how many patients were correctly identified as having been diagnosed with that disease or condition. Finally, divide that number by all of the patients whose status was known for sure (i.e., those who were definitely diagnosed).

This will give you an accurate estimate of how well your algorithm detected positives among all possible positives (i.e., all patients).

Recall, TP / (TP + FN).

From all the positive classes, 67.71% we predicted correctly.

Recall should be as high as possible.

What Are Unbalanced Datasets?

Unbalanced datasets are a type of data set where one class has a significantly larger proportion of observations than others. The target variable has a larger ratio of observations for one class. For example, if you have 1 million items, and your target variable is “male,” then there may be more than 500,000 items marked as male but only about 1000 or so items marked as female.

That can be problematic because many machine learning algorithms assume that every item has an equal chance of being classified as either male or female. So they won’t work well when it is time to make predictions on your data set.

Here are a few things you can do to fix an unbalanced data set:

Resampling (Oversampling or Undersampling)

With resampling, you can create a new data set by randomly selecting items from your original data set. You can do this by over- or undersampling specific classes.

For example, suppose your training dataset is unbalanced because there are more male than female customers in your database. In that case, you could oversample females to even out the distribution of observations between males and females.

This will help improve the performance of your model. You can also choose to undersample males, which will have the opposite effect-it will reduce the number of observations in your training set.

You could use sampling with replacement or without replacement. Note that sampling with replacement can introduce bias into your data set. If you want to avoid this problem, it’s best to use sampling without replacement.

Oversampling involves generating synthetic data that is a more balanced representation of the population. You can do this by sampling from different parts of the world, significantly reducing bias in your data.

Another way to improve diversity in your data set is by using low-resolution images and then upsampling them. This will ensure that you have a wide variety of samples at different resolutions so your model can learn how to evaluate any type of image, not just high-resolution ones.

Ensembling Methods

Ensemble methods leverage multiple learning algorithms and techniques to obtain better results than any single approach.

This is done by combining the predictions of different algorithms into a single prediction. For example, you can train five different classifiers in different ways and use them to predict which category an image belongs to.

The final prediction would be a weighted average of each model’s prediction with more weight given to the most accurate models.

Ensemble methods can be very powerful, but they are also more complex than other techniques and require a lot of data. Voting and averaging are two of the simplest ways to implement ensemble methods, while stacking is a complex technique that involves combining models using another algorithm.

F1 Score Machine in Learning

The F1 score is a powerful way of measuring a model’s performance. It combines two metrics: precision and recall.

It is calculated as follows:

F 1 score, 2 * (Precision * Recall)/(Precision + Recall).

F1-score is 68.42%. It helps to measure Recall and Precision at the same time. You cannot have a high F1 score without a strong model underneath.

Let’s repeat what we’ve learned so far.

Precision is the percentage of relevant results your model returns. It measures how accurate your model is at identifying the correct answer instead of returning any result that could be considered a match to what you’re looking for.

Recall, on the other hand, measures how much of your search results were relevant. It tells you whether or not your search should have returned any results at all.

The F1 score is a weighted average of precision versus recall. One point is added to precision if the result is relevant, and one point is added to recall if at least one result is relevant. The resulting value gives you an idea of how closely your model matches what was searched for.

The F1 score is a good metric for evaluating search results because it gives you an easy way to compare different models. If you have two models with very similar precision versus recall scores, it’s hard to say which one is better.

With the F1 score, you can compare them with a single metric by weighing their respective scores equally. It gives more weight to false positives and false negatives, which are false predictions that will significantly impact accuracy and precision.

The best way to use the F1 score is to compare your results against a baseline model. If you’re trying to improve your model’s performance, comparing it with a baseline model with an F1 score of 0.5 or higher will help you see how much better (or worse) it performs.

You can also use the F1 score as part of a statistical hypothesis test to determine whether your improvements can make a difference in real-world use cases.

Precision versus Recall: The Bottom Line

Precision is one of the most critical concepts in machine learning because it determines how accurately a classifier or predictor identifies the things that are relevant to its task.

The recall is likewise essential; when trying to make sense of data, you want to know if the classifier or predictor is identifying all the relevant information.

When using these concepts, it’s essential to be aware of their limitations. Precision and recall aren’t absolute numbers-they are measurements in relation to a set of data that can change depending on what other information might be available. They don’t consider any user-specific factors like demographics or location, either.

In conclusion, precision and recall are two key concepts for understanding how machine learning works. They will help you understand how well your classifiers identify patterns in data, thereby improving those models by making sure they recognize all relevant information while ignoring irrelevant details.

Now that you understand essential machine learning model metrics more, you can understand your prediction models more.

Like the content? Let’s connect.

If you believe this article is worth sharing, spread the word and help others discover its value.

Fun tip: Try clicking the clap button for the magic to happen! ❤️

You can get in touch with me on LinkedIn.