Mastering Classification Metrics: A Beginners Guide [Part 1: Accuracy, Precision, and Recall]

12 min readMar 30, 2023

Chapter 1: “Understanding Basic Classification Metrics: Accuracy, Precision, and Recall”

Chapter 2: “Balancing Precision and Recall: F1, F0.5, and F2 Scores Explained”
Chapter 3: “Evaluating Imbalanced Data: The Importance of ROC-AUC Curves”
Colab File: Colab File on Github
Dataset: SPAM Classification and Breast Cancer Classification

1. Introduction

Welcome to the first article in our series on Mastering Classification Metrics! This article, “Understanding Basic Classification Metrics: Accuracy, Precision, and Recall,” aims to help both beginners and experienced professionals gain a solid understanding of the foundational evaluation metrics used in classification problems. We will dive into the concepts of accuracy, precision, and recall, providing clear explanations and practical examples to illustrate their importance in the field of Data Science.

Before I dive into Accuracy, Precision, and Recall metrics, in this first article I wanted to mention all the available Classification model metrics and compare them with each other:

Accuracy vs. Precision vs. Recall: Accuracy measures the proportion of correct predictions among the total number of predictions. It’s a widely used metric but may not be suitable when the classes are imbalanced or when false positives and false negatives have different costs. Precision measures the proportion of true positive predictions among all positive predictions, while Recall measures the proportion of true positive predictions among all actual positive instances. Precision is crucial when false positives are costly, and Recall is vital when false negatives are costly. These metrics are more informative than accuracy in many situations.
F1 Score vs. Accuracy, Precision, and Recall: The F1 Score is the harmonic mean of Precision and Recall, providing a single metric that balances the two. It is particularly useful when you want to consider both Precision and Recall equally important or when dealing with imbalanced datasets. The F1 Score ranges between 0 and 1, with 1 being the best possible score. However, it may not be suitable for all situations, as it treats Precision and Recall as equally important, which might not always be the case.
F0.5 and F2 Scores vs. F1 Score: The F0.5 and F2 Scores are generalizations of the F1 Score, allowing for different weights to be assigned to Precision and Recall. The F0.5 Score weighs Precision more heavily, while the F2 Score weighs Recall more heavily. These metrics can be useful when one of the two measures is more important than the other. However, choosing the appropriate weighting requires domain knowledge and a clear understanding of the problem context.
ROC and AUC vs. Accuracy, Precision, and Recall: The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (Recall) against the False Positive Rate, providing a visual representation of the trade-offs between these two metrics. The AUC (Area Under the Curve) is a single value that summarizes the overall performance of the classifier across all possible thresholds. AUC is particularly useful when dealing with imbalanced datasets or when choosing the optimal classification threshold. However, AUC can be less informative when the costs of false positives and false negatives are very different, as it does not explicitly consider these costs.

In summary, there is no one-size-fits-all evaluation metric for classification problems. The choice of metric depends on the problem context, the dataset characteristics, and the specific costs associated with false positives and false negatives. Understanding the trade-offs between different evaluation metrics is essential for selecting the most appropriate one for a given problem.

2. Dataset selection and sourcing

In this article, we will be using the Breast Cancer Wisconsin (Diagnostic) Data Set, obtained from the UCI Machine Learning Repository. This dataset contains 569 instances, each representing a breast mass, with 30 features extracted from digitized images of fine needle aspirates of breast masses. The objective of the classification problem is to predict whether a breast mass is benign (B) or malignant (M), based on these features.

The reason for selecting this dataset is that it presents a real-world classification problem with direct implications for healthcare and medical diagnosis. By working with this dataset, we can gain a better understanding of how accuracy, precision, and recall can affect the performance of a classification model, as well as the potential consequences of misclassification in a real-life setting.

The Breast Cancer Wisconsin (Diagnostic) Data Set has a relatively balanced class distribution, with 357 benign cases and 212 malignant cases. This balance allows us to focus on understanding the differences between accuracy, precision, and recall without the added complexity of dealing with imbalanced data. As we explore these metrics, we will see how each one can provide valuable insights into the performance of our classification model, and how they can be used to guide our decision-making process in different scenarios.

from sklearn.datasets import load_breast_cancer

# Load the breast cancer dataset
data = load_breast_cancer()

# Get the input features and target variable
X = data.data
y = data.target

3. Brief Introduction to Classification Algorithms

Classification algorithms are a fundamental subset of machine learning techniques that enable us to categorize data into distinct classes or labels. These algorithms are widely used across various industries, including healthcare, finance, marketing, and more, to make predictions and solve real-world problems.

In this section, we will briefly introduce some of the most commonly used classification algorithms, which will help provide context for our analysis and evaluation of accuracy, precision, and recall.

Logistic Regression: Logistic regression is a simple yet effective algorithm that predicts the probability of an instance belonging to a specific class. It is particularly well-suited for binary classification problems, where there are two possible outcomes. The algorithm uses a logistic function (also known as the sigmoid function) to estimate the probabilities.
Decision Trees: A decision tree is a graphical representation of the possible outcomes of a decision based on certain conditions. It consists of nodes representing conditions, branches representing decision rules, and leaf nodes representing the final outcome (class labels). Decision trees can be used for both classification and regression tasks.
Support Vector Machines (SVM): SVM is a powerful classification algorithm that seeks to find the best hyperplane that separates data points of different classes. This hyperplane acts as a decision boundary that maximizes the margin between the classes, allowing for better generalization to new, unseen data.
Random Forest: Random forest is an ensemble learning method that constructs multiple decision trees and combines their output to make a final prediction. By aggregating the results from multiple trees, random forests can often achieve better performance than a single decision tree, reducing the risk of overfitting and improving overall accuracy.
K-Nearest Neighbors (KNN): KNN is a simple, instance-based learning algorithm that classifies a new instance based on the majority class of its k-nearest neighbors in the training data. The algorithm computes the distance between the new instance and all other instances in the training data, selecting the k-closest neighbors to determine the class label.
Neural Networks: Neural networks are a family of algorithms inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized into layers. Neural networks can learn complex patterns and representations, making them well-suited for various classification tasks, including image and text classification.

These are just a few examples of the many classification algorithms available. In the following sections, we will focus on evaluating the performance of these algorithms using accuracy, precision, and recall, providing you with a better understanding of how to choose the right evaluation metric for your specific problem.

4. Creating Classification Models

You can easily create these models with just one line of code using the sci-kit learn python package:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Build the models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "SVM": SVC(),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier()
}

5. Calculating Evaluation Metrics

Accuracy: Accuracy is the proportion of correct predictions (both true positives and true negatives) out of the total number of instances. Formula: Accuracy = (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)
Precision: Precision is the proportion of true positives out of the total predicted positives. Formula: Precision = True Positives / (True Positives + False Positives)
Recall (Sensitivity): Recall is the proportion of true positives out of the total actual positives. Formula: Recall = True Positives / (True Positives + False Negatives)

Here is how you can calculate the following metrics in python, I have also added F1 Score, but we will speak about it in in the next article in detail. Again I will be using the sci-kit learn python package for calculating these evaluation metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Create the function for giving different evaluation metrics
def print_metrics(model_name, y_test, y_pred):
    print("Model Used:", model_name)
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Precision Score:", precision_score(y_test, y_pred))
    print("Recall Score:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print()

6. Importance of Choosing the Right Evaluation Metric

When working on a classification problem, especially in real-world scenarios, it is crucial to choose the right evaluation metric. A model’s performance can vary depending on the metric used, and certain metrics may be more relevant to a specific problem than others. In this section, we will discuss how to interpret the results of Accuracy, Precision, Recall, and F1 Score, and determine which metrics are the most important for the breast cancer classification problem.

In the context of breast cancer classification, Recall is particularly important because it measures the proportion of true positive predictions among all actual positive instances. A high Recall indicates that the classifier can identify most of the positive cases (malignant tumors), which is crucial for early diagnosis and treatment. Precision is also important, as it measures the proportion of true positive predictions among all positive predictions. A high Precision indicates that the classifier can accurately identify malignant tumors, minimizing the number of false alarms (false positives) that could cause unnecessary stress and additional medical procedures for patients.

Accuracy, while a commonly used metric, might not be the best metric for this problem due to the possible imbalance between the number of benign and malignant cases. Additionally, accuracy does not account for the different costs associated with false positives and false negatives. The F1 Score is a balanced metric that combines both Precision and Recall, which can be helpful when both metrics are considered equally important.

7. Model Comparison and Selection

Let’s analyze the performance of the five classifiers according to the evaluation metrics.

Logistic Regression:
Accuracy: 0.974
Precision: 0.972
Recall: 0.986
F1 Score: 0.979
Decision Tree:
Accuracy: 0.939
Precision: 0.944
Recall: 0.958
F1 Score: 0.951
SVM (Support Vector Machine):
Accuracy: 0.974
Precision: 0.972
Recall: 0.986
F1 Score: 0.979
Random Forest:
Accuracy: 0.965
Precision: 0.959
Recall: 0.986
F1 Score: 0.972
KNN (K-Nearest Neighbor):
Accuracy: 0.947
Precision: 0.958
Recall: 0.958
F1 Score: 0.958

Based on the results, both Logistic Regression and SVM have the highest Recall and F1 Scores. Given the importance of Recall in the breast cancer classification problem, these two models are the top candidates for further consideration.

To choose between Logistic Regression and SVM, you can further investigate other factors such as model interpretability, training time, and prediction time. Logistic Regression tends to be more interpretable and faster to train, while SVM may provide better performance in some cases but can be more computationally expensive.

In conclusion, based on the evaluation metrics and the specific requirements of the breast cancer classification problem, Logistic Regression and SVM appear to be the best-performing models. It is essential to consider the trade-offs between different evaluation metrics, model interpretability, and computational efficiency when selecting the most appropriate model for a specific problem.

8. Summary and Conclusion

In this article, we explored the importance of understanding different evaluation metrics and their application to classification problems, specifically in the context of breast cancer classification. We discussed the significance of choosing the right evaluation metric based on the problem at hand and demonstrated the process of evaluating and comparing classification models using Accuracy, Precision, Recall, and F1 Score.

Our analysis showed that Logistic Regression and SVM models achieved the highest Recall and F1 Score, making them the top candidates for the breast cancer classification problem. Given the importance of Recall in this problem, these two models were considered the most appropriate choices. We also highlighted the trade-offs between different evaluation metrics, model interpretability, and computational efficiency when selecting the most suitable model for a specific problem.

In conclusion, understanding and selecting the appropriate evaluation metrics for a classification problem is crucial for obtaining reliable and meaningful results. By carefully analyzing and comparing different models using relevant metrics, practitioners can make informed decisions about which model is best suited for a specific task. In the case of breast cancer classification, Logistic Regression and SVM models emerged as the top performers, offering a good balance between high Recall, Precision, and F1 Score. Ultimately, the choice of the best model depends on the specific requirements and constraints of the problem, such as interpretability, training time, and prediction time.

Precision is an essential evaluation metric in business scenarios where the cost of false positives is high, and it is crucial to minimize them. For example, in email spam filtering, we want to avoid classifying legitimate emails as spam, as this may lead to important information being lost or missed. Similarly, in fraud detection, precision is critical because falsely accusing a customer of fraudulent activities could lead to loss of business and negative customer experiences.
Accuracy, on the other hand, is a suitable metric when the costs associated with false positives and false negatives are relatively balanced, and the distribution of classes in the dataset is more or less even. For example, in a marketing campaign where the objective is to predict which customers are likely to respond positively to a new product, accuracy can be an appropriate metric, as long as there is no significant imbalance in the dataset. Another example could be the classification of news articles into different categories, where both false positives and false negatives have similar consequences, and the dataset has a relatively balanced distribution of classes.
It is crucial to consider the specific requirements and constraints of a given business scenario when selecting the most appropriate evaluation metric, as this can significantly impact the effectiveness and performance of the chosen classification model.

To demonstrate another example where Precision would be more important, I used the SPAM classification data and ran the classification models on the same:

# Load the dataset
data = pd.read_csv("gdrive/My Drive/datasets/SMSSpamCollection/SMSSpamCollection", sep="\t", header=None, names=["label", "text"])

# Preprocess the data
data["label"] = data["label"].map({"ham": 0, "spam": 1})

# Create feature vectors using the Bag of Words or TF-IDF approach
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data["text"])
y = data["label"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Evaluate the models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "SVM": SVC(),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Model Used: {name}")
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Precision Score:", precision_score(y_test, y_pred))
    print("Recall Score:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("\n")

In this spam classification problem, we want a model that can identify spam messages accurately (high precision) while also minimizing the number of legitimate messages that are mistakenly classified as spam (high recall). Among the results, the Random Forest model performs the best in terms of balancing precision and recall, while also achieving a high accuracy score.

The Random Forest model has a perfect precision score of 1.0, meaning it doesn’t misclassify any legitimate messages as spam. The recall score of 0.8859 means that the model identifies 88.59% of the spam messages correctly. The F1 score, which is the harmonic mean of precision and recall, is also the highest among all models at 0.9395, indicating that the Random Forest model achieves the best balance between precision and recall for this problem.

Thus, the Random Forest model is the most suitable choice for this spam classification problem, considering the importance of precision and recall in this scenario.

In the upcoming parts of this article series, we will delve deeper into other classification evaluation metrics, such as F1, F0.5, and F2 scores, as well as ROC AUC curves. These metrics also play a vital role in assessing the performance of classification models in various scenarios.

In Part 2, we will explore the F1, F0.5, and F2 scores, which combine Precision and Recall in different ways, allowing us to fine-tune our model evaluation based on the importance we assign to each of these metrics. These F-scores can be particularly helpful in situations where there is an imbalance between the cost of false positives and false negatives, or when dealing with imbalanced datasets.

In the final part of the series, we will discuss the importance of ROC AUC curves, which provide a comprehensive view of a model’s performance across different classification thresholds. This metric is particularly useful in cases where the optimal decision threshold is uncertain or when we need to compare the overall performance of different models, irrespective of specific thresholds.

By understanding the different evaluation metrics and their applicability to various problem scenarios, we can make well-informed decisions when selecting the most appropriate models for our classification tasks.