Evaluating Machine Learning Model Performance: Accuracy, Precision, Recall, and F1 Score

3 min readJul 1, 2023

Let’s imagine a scenario where we have a task of classifying whether a person is pregnant or not pregnant. In this task, if a pregnancy test is positive, it indicates that the person is pregnant, and if the test is negative, it means the person is not pregnant.

In the context of the classification task, there are four important categories to consider:

True Positive (TP)

This refers to a person who is actually pregnant (positive) and is correctly classified as pregnant (positive) by the machine learning algorithm. In other words, the algorithm correctly identifies the positive case.

True Negative (TN)

This refers to a person who is actually not pregnant (negative) and is correctly classified as not pregnant (negative) by the machine learning algorithm. In this case, the algorithm correctly identifies the negative case.

False Positive (FP)

refers to a scenario where the machine learning algorithm incorrectly classifies a person as positive (pregnant) when they are actually negative (not pregnant). In other words, the algorithm wrongly identifies a negative case as positive.

False Negative (FN)

refers to a scenario where the machine learning algorithm incorrectly classifies a person as negative (not pregnant) when they are actually positive (pregnant). In other words, the algorithm wrongly identifies a positive case as negative.

Confusion Matrix

A confusion matrix is a tabular representation that summarizes the performance of a machine learning classification model. It provides a detailed breakdown of the model’s predictions and the actual class labels of the data. The matrix is typically a 2x2 table for binary classification tasks, but it can be extended for multi-class problems.

What is Accuracy in Machine Learning?

Accuracy is a widely used metric for evaluating classification models. It measures the overall correctness of predictions made by the model. Accuracy is calculated by dividing the number of correct predictions by the total number of predictions. It provides a general overview of the model’s performance by giving the percentage of correctly classified instances.

Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)

accuracy is not always the best measure to evaluate the performance of a machine learning model, especially in scenarios where the dataset is imbalanced or the costs of different types of errors are significantly different. Accuracy alone may not provide a comprehensive understanding of the model’s effectiveness.

The choice of the best evaluation measure depends on the specific problem and the goals of the application. Different evaluation metrics focus on different aspects of the model’s performance.

Precision: It measures the proportion of true positive predictions among all positive predictions. Precision focuses on the correctness of positive predictions and is useful when the cost of false positives is high. It is calculated as

Precision = True Positives / (True Positives + False Positives)

Recall (Sensitivity or True Positive Rate): It measures the proportion of true positive predictions among all actual positive instances. Recall focuses on the model’s ability to identify positive instances correctly and is useful when the cost of false negatives is high. It is calculated as

Recall = True Positives / (True Positives + False Negatives)

F1 Score: It is the harmonic mean of precision and recall, providing a balanced measure that considers both metrics. The F1 score combines precision and recall into a single value, giving equal importance to both metrics. It is calculated as

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Each metric provides unique insights into different aspects of the model’s performance.

Evaluating Machine Learning Model Performance: Accuracy, Precision, Recall, and F1 Score

What is Accuracy in Machine Learning?

Written by amir mirsaeid