Mastering Classification Metrics: A Beginners Guide [Part 3: Importance of ROC-AUC Curves]

10 min readMar 30, 2023

Chapter 3: “Evaluating Imbalanced Data: The Importance of ROC-AUC Curves (MCC, Balanced Accuracy & Cohen’s Kappa)”

Chapter 1: “Understanding Basic Classification Metrics: Accuracy, Precision, and Recall”
Chapter 2: “Balancing Precision and Recall: F1, F0.5, and F2 Scores Explained”
Colab File: Colab File on Github
Dataset: Bank Marketing Data

1. Introduction

In the previous two parts of this article series, we explored various classification metrics and their significance in machine-learning problems. In Part 1, we covered the basics of classification metrics, including accuracy, precision, and recall. We demonstrated their application using a spam classification dataset and a breast cancer dataset. In Part 2, we delved deeper into F-Scores (F1, F0.5, and F2), understanding their trade-offs and when to use them in different scenarios. We used a credit card fraud dataset to illustrate the importance of F-Scores in imbalanced classification problems.

In this final part of the series, we will shift our focus to another essential evaluation metric: the Receiver Operating Characteristic (ROC) curve and its Area Under the Curve (AUC). The ROC-AUC is particularly useful for evaluating classification models in imbalanced datasets, as it helps us understand the trade-offs between true positive rates and false positive rates. In this article, we will cover:

The intuition behind the ROC curve and AUC, and why they are essential in classification problems.
How to calculate and interpret the ROC-AUC metric using a real-world dataset.
Comparing various classification models using ROC-AUC scores.

For this article, we will be using the “Bank Marketing” dataset, which contains information about a direct marketing campaign of a Portuguese banking institution. The goal of this dataset is to predict whether the client will subscribe to a term deposit, making it an ideal case study for demonstrating the use of ROC-AUC curves in an imbalanced classification problem.

Stay tuned as we unravel the concepts and applications of the ROC-AUC metric and learn how it can help you make more informed decisions when selecting classification models for imbalanced datasets.

2. Dataset Selection and Sourcing

In this article, we will be using the “Bank Marketing” dataset to demonstrate the importance of ROC-AUC curves in evaluating classification models for imbalanced datasets. The dataset contains information about a direct marketing campaign conducted by a Portuguese banking institution. The goal is to predict whether a client will subscribe to a term deposit, making it a binary classification problem.

The dataset includes various features related to the clients and the marketing campaign, such as age, job, marital status, education, loan status, and contact duration, among others. The target variable, ‘y’, indicates whether the client subscribed to the term deposit (1 for ‘yes’ and 0 for ‘no’). The dataset is imbalanced, with a higher number of clients not subscribing to the term deposit, making it a suitable choice to illustrate the value of ROC-AUC curves in such scenarios.

The “Bank Marketing” dataset can be sourced from the UCI Machine Learning Repository at the following link:

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

To use this dataset in your project, download the ‘bank-additional-full.csv’ file, which contains the complete dataset with all features and instances.

In the next section, we will discuss the necessary preprocessing steps to clean and prepare the data for model building and evaluation.

3. Data Preprocessing

Before building and evaluating our classification models, it is essential to preprocess the dataset to ensure that it is clean and prepared for model training. In this section, we will cover the following preprocessing steps:

Handling categorical variables
Handling missing values
Scaling numerical features
Splitting the dataset into training and testing sets

Let’s start by loading the dataset and inspecting its structure:

# Load the dataset
data = pd.read_csv('gdrive/My Drive/datasets/Mastering Classification Metrics Medium/Bank Marketing Dataset/bank-additional-full.csv', sep=';')
data

1. Handling categorical variables

The dataset contains several categorical variables that need to be converted into a numerical format for machine learning algorithms. We can use one-hot encoding to transform these categorical features into binary columns.

# One-hot encode categorical variables
data = pd.get_dummies(data, columns=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome'], drop_first=True)

2. Handling missing values

After encoding the categorical variables, we should check for any missing values in the dataset and decide how to handle them. In this dataset, missing values are represented as ‘unknown’. We can either drop these instances or impute the missing values with an appropriate method.

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values:", missing_values)

# As the dataset does not contain any missing values, we do not need to perform any imputation or drop instances.

3. Scaling numerical features

To ensure that all numerical features are on the same scale and contribute equally to the model, we should standardize these features using techniques such as Min-Max scaling or Standard Scaling.

# Scale numerical features
numerical_features = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])

4. Splitting the dataset into training and testing sets

Finally, we need to split the dataset into training and testing sets to evaluate our models’ performance on unseen data.

# Split the dataset into features (X) and target (y)
X = data.drop('y', axis=1)
y = data['y'].map({'yes': 1, 'no': 0})

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

With the data preprocessed, we can now move on to the model-building and evaluation stage, where we will apply various classification algorithms and assess their performance using ROC-AUC curves.

4. Model Building and Evaluation

In this section, we will build and evaluate different classification models using preprocessed data. We will employ the following algorithms:

Logistic Regression
Decision Tree
Support Vector Machines (SVM)
Random Forest
K-Nearest Neighbors (KNN)

For each model, we will calculate the ROC-AUC score to assess its performance on the imbalanced dataset.

# Define the models
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "SVM": SVC(random_state=42, probability=True),
    "Random Forest": RandomForestClassifier(random_state=42),
    "KNN": KNeighborsClassifier()
}

# Train and evaluate the models
for name, model in tqdm(models.items(), desc="Training Models"):
    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auc_score = roc_auc_score(y_test, y_pred_proba)
    print(f"Model: {name}\nROC-AUC Score: {auc_score}\n")

After training and evaluating each model, we will have their respective ROC-AUC scores. This metric will help us compare the performance of different models on the imbalanced dataset and understand which classifier works best for this particular problem.

In the next section, we will dive deeper into ROC-AUC curves and understand how they help assess the performance of classification models on imbalanced datasets.

5. Understanding ROC-AUC Curve

In this section, we will discuss the ROC-AUC curve and how it can be useful in evaluating classification models, particularly in imbalanced datasets.

The ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s ability to differentiate between classes. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification threshold levels. The AUC (Area Under the Curve) represents the degree to which the model can distinguish between the classes, with a value of 1 indicating the perfect classification and a value of 0.5 indicating random chance.

In the context of our bank marketing dataset, the ROC-AUC curve helps us understand the trade-off between identifying clients who will subscribe to a term deposit (true positives) and incorrectly classifying clients who won’t subscribe as positive (false positives).

Based on the results obtained, we can see the following ROC-AUC scores for each model:

The plot is known as the ROC curve, and the AUC (Area Under the Curve) is a metric derived from the ROC curve.
ROC curve: The ROC (Receiver Operating Characteristic) curve is a graphical representation that plots the True Positive Rate (TPR, also known as sensitivity or recall) against the False Positive Rate (FPR, also known as 1-specificity) at various classification threshold levels. It helps visualize the trade-off between correctly classifying positive instances (true positives) and incorrectly classifying negative instances as positive (false positives).
AUC: The AUC (Area Under the Curve) is a metric that measures the area under the ROC curve. It represents the model’s ability to distinguish between the classes, with a value of 1 indicating perfect classification and a value of 0.5 indicating random chance. The AUC can be used to compare different models and is particularly useful when dealing with imbalanced datasets.

Logistic Regression: 0.9424
Decision Tree: 0.7413
SVM: 0.9148
Random Forest: 0.9473
KNN: 0.8767

The Random Forest model has the highest ROC-AUC score, indicating the best performance among the models in terms of distinguishing between clients who will and won’t subscribe to a term deposit. On the other hand, the Decision Tree model has the lowest ROC-AUC score, indicating a weaker performance in comparison.

6. Summary and Conclusion

In this article, we have focused on the importance of the ROC-AUC curve for evaluating classification models, particularly when dealing with imbalanced datasets. We have demonstrated this using the bank marketing dataset, which involved predicting whether a client would subscribe to a term deposit.

We began by discussing what was covered in the previous articles of this series, where we examined precision, recall, accuracy, and F-scores. Then, we introduced the bank marketing dataset and discussed preprocessing steps. Next, we built and evaluated five different classification models, comparing their ROC-AUC scores. Finally, we explained the ROC-AUC curve and its significance in model evaluation.

To summarize, the ROC-AUC curve is an essential metric when dealing with imbalanced datasets, as it provides a comprehensive view of the model’s performance in classifying instances from both classes. In our example, the Random Forest model achieved the highest ROC-AUC score, suggesting the best performance in distinguishing between clients who will and won’t subscribe to a term deposit.

We have covered all the most commonly used evaluation metrics. However, there are a few additional metrics we might consider:

Matthews Correlation Coefficient (MCC): The MCC is a measure of the quality of binary classification that takes into account true and false positives and negatives. It ranges from -1 (inverse prediction) to +1 (perfect prediction), with 0 indicating random chance. The MCC is particularly useful for imbalanced datasets.
Balanced Accuracy: Balanced accuracy calculates the average of recall (sensitivity) obtained on each class and is useful for imbalanced datasets. It ranges from 0 to 1, with 1 indicating the perfect classification and 0.5 indicating random chance.
Cohen’s Kappa: Cohen’s Kappa is a statistic that measures inter-rater agreement for categorical items. It is used to assess the performance of a classification model compared to a random baseline. The Kappa statistic ranges from -1 to +1, with +1 indicating perfect agreement, 0 indicating no better agreement than chance, and negative values indicating agreement worse than random chance.

These additional metrics can help provide further insights into our classification model’s performance, especially when dealing with imbalanced datasets or when a more nuanced understanding of the model’s behavior is required. Please check the colab File as I have used these 3 metrics as well on our Bank Marketing Data Set, here is the output of the following:

ROC-AUC Score, MCC, Balanced Accuracy, and Cohen’s Kappa metrics on Bank Marketing Dataset

By considering these additional metrics, we can confirm that the Random Forest model is indeed the best-performing model among the ones tested, as it has the highest scores for all the evaluation metrics, including ROC-AUC. This reinforces our previous interpretation based on just the ROC-AUC scores.

Most Important Takeaways From This Series of Articles

Understanding the metrics: Different metrics like accuracy, precision, recall, F-scores, and ROC-AUC are crucial for evaluating the performance of classification models. Each metric has its unique characteristics, making it more suitable for specific scenarios.
Importance of context: The choice of the appropriate metric largely depends on the context of the problem you’re trying to solve. For example, precision and recall are essential when dealing with imbalanced datasets or when the cost of false positives and false negatives are significantly different.
F-scores for balancing precision and recall: F-scores, such as F1, F0.5, and F2, provide a single metric that balances the trade-off between precision and recall. The choice of which F-score to use depends on the importance you place on precision vs. recall in your specific application.
ROC-AUC for evaluating classifier performance: The ROC-AUC curve is a powerful visualization tool for evaluating classifier performance across various decision thresholds. The AUC (Area Under the Curve) value provides a single scalar value representing the model’s performance, with higher values indicating better classification.
Additional metrics for specific scenarios: Metrics like Matthews Correlation Coefficient, Balanced Accuracy, and Cohen’s Kappa can be used in specific scenarios or when dealing with imbalanced datasets to provide a more nuanced understanding of the model’s behavior.