Handling Imbalanced Datasets in Fraud Detection with Python: A Comprehensive Guide to Logistic Regression

Step-by-Step Approach for Accurate and Effective Fraud Detection

8 min readMar 28, 2023

Introduction

Fraud detection is a crucial task for various industries, including banking, insurance, and e-commerce. As fraudulent activities become increasingly sophisticated, traditional rule-based methods may not be sufficient in detecting fraudulent transactions. This is where machine learning techniques, such as logistic regression, can provide a more accurate and efficient solution. In this comprehensive guide, we will delve into the implementation of logistic regression for fraud detection using the popular Sklearn library in Python. We will take an end-to-end approach by utilizing an open-source dataset from Kaggle and demonstrate the steps involved in building a logistic regression model from data preprocessing to evaluation. By the end of this guide, you will have a thorough understanding of how to implement logistic regression in Python for fraud detection.

By the end of this article, you will have learned how to:

Understand the implementation of logistic regression for fraud detection using Sklearn in Python.
Learn the necessary preprocessing steps, such as data cleaning, normalization, and feature scaling, before building the model.
Discover techniques to handle imbalanced datasets and understand how to evaluate model performance using relevant metrics like accuracy, precision, recall, and the confusion matrix.

What is Logistic Regression
Problem Statement
Import Data and Python Packages
Exploratory Data Analysis
Logistic Regression and Confusion matrix
Overfitting and Learning Curve
Conclusion

What is Logistic Regression

Logistic regression is a machine learning algorithm used for binary classification tasks, where the goal is to predict the probability of an observation belonging to one of the two classes (e.g., Yes or No, 1 or 0). It is a type of regression analysis that estimates the probability of an event occurring based on one or more independent variables.

Logistic regression uses a logistic function (also known as a sigmoid function) to transform a linear equation into a nonlinear one, allowing it to map input values to the output probability of a binary outcome. The algorithm learns the coefficients of the independent variables that maximize the likelihood of the observed data and then uses these coefficients to make predictions on new data.

Logistic regression is widely used in various fields, including finance, marketing, and healthcare, for tasks such as predicting customer churn, detecting fraud, and diagnosing diseases.

Problem Statement

The problem at hand is to predict fraud using credit card data. In this article, we will provide an end-to-end example of how to implement logistic regression for fraud detection using the available credit card dataset, which can be downloaded from the provided link.

Step-1:Import Data and Python Packages

Let’s first import all the required packages in Python.

Then, we load the fraud dataset using the pandas read CSV function.

Step-2: Exploratory Data Analysis

After loading the dataset, it is important to gain a deep understanding of the data before building any model. This process is known as exploratory data analysis (EDA) and typically includes the following steps:

Data Inspection: This involves checking the dimensions of the dataset, the types of variables, and the presence of any missing values or outliers.
Data Visualization: This step includes visualizing the data using various plots such as histograms, scatter plots, box plots, and correlation matrices. Visualization helps in understanding the distribution of the data, identifying any patterns or trends, and detecting outliers.
Data Cleaning: This step involves handling missing values, outliers, and data inconsistencies. Depending on the nature of the data, this may involve imputation, deletion, or correction of the data.
Feature Engineering: This involves creating new features from the existing ones or selecting relevant features for the analysis. Feature engineering aims to extract the most useful information from the data that can be used to improve the performance of the model.
Data Transformation: This step involves transforming the data to meet the assumptions of the model. This may include scaling, normalization, or encoding categorical variables.

By conducting an EDA, we can gain insights into the data and prepare it for the modeling phase, which includes selecting an appropriate algorithm, training the model, and evaluating its performance.

As an illustration, let us visualize the boxplots for all the categorical variables using the syntax given below

Based on the information above, we can observe that Class (1) and Class (0) have significantly different distributions among the V1, V2, V3, and V4 features. This finding is a good indicator for us to consider in the model-building process, as it suggests that these features may have a significant impact on predicting the outcome variable, which in this case is the presence of fraud. By taking this information into account, we can make more informed decisions about which features to include in our model and how to preprocess the data before training the model.

After conducting exploratory analysis on the independent variables, we can also visualize the target class to gain insights into its distribution. The bar plot shows that the dataset is predominantly composed of non-fraudulent transactions, indicating an imbalanced class problem. The code snippet output below displays the model’s expected accuracy, with the baseline accuracy at 0.9982. However, the class imbalance can reduce the effectiveness of our models. In contrast, if we classify all records as Class (0), we can still achieve an accuracy of 99.82%.

To address the issue of imbalanced classes in our fraud detection analysis, we need to consider not only the accuracy of our model but also its recall, which measures how often the model correctly predicts fraud when it actually occurs.

One way to improve recall is through random sampling techniques such as oversampling and undersampling. Oversampling involves duplicating samples from the minority class, while undersampling involves deleting samples from the majority class. However, both approaches have their drawbacks, with oversampling potentially leading to overfitting and undersampling resulting in the loss of important information.

Step-3: Logistic Regression and Confusion matrix

In this step, our goal is to compare the accuracy and Recall of the classifier with different sampling methods, including normal (without sampling), oversampling, and undersampling. To achieve this, we start by splitting the dataset into training and testing sets using the train_test_split() function. Additionally, we use the standard scaler function to scale the values into a typical range, which helps in improving the performance of the model.

Then, we apply the undersampling and oversample approach to create two new datasets for modeling. The concept is that we can use a modest amount of oversampling to the minority class, which improves the bias to the minority class examples. At the same time, we also perform a fair amount of undersampling on the majority class to reduce the bias on the majority class examples.

Now, we will proceed to train our models. We will not be passing any parameters to LogisticRegression() and assume default parameters. However, it’s essential to know some of the crucial parameters:

penalty: Default = L2 — Specifies the norm for the penalty
C: Default = 1.0 — Inverse of regularization strength
solver: Default = ‘lbfgs’ — Optimizer algorithm.

Here’s the code implementation for training and testing the logistic regression models.

Model 1 — without any sampling with default parameters in Logistic Regression

We reported the overall accuracy, Confusion matrix, Precision, and Recall below:

Although, we saw the overall accuracy is 99.91% which seems excellent. When we looked at Recall for Class(1) is only 62% which means that we can only identify 62% of actual fraud.

Model 2 — Undersampling modeling with the same default parameter.

We reported the overall accuracy, Confusion matrix, Precision, and Recall below:

Although we saw the overall accuracy reduced to 90.86%, we observed Recall for Class(1) was significantly increased to 94%; in this case, we can identify 94% of total fraud.

The oversampling part will be very similar to the undersampling one, and I won’t show it here as an example, but feel free to take it as homework for practice.

Up to this point, we have demonstrated that undersampling and oversampling can be effective in addressing the imbalanced dataset issue in classification models. In reality, imbalanced data is a common occurrence, and it is crucial to always identify and address this issue before constructing any model.

Step-4: Overfitting and Learning Curve

As a final step, it is important to check the learning curve to ensure that our undersampling model is not overfitting. This helps us to determine whether the model is learning from the data effectively or if it is just memorizing the training data.

The Learning curve for undersampling default Logistic Regression Model:

Based on the learning curve plot, it appears that the training score and cross-validation score are converging, indicating that our model is not overfitting the training data. Therefore, we can conclude that the undersampling approach has effectively addressed the imbalanced dataset problem. To further improve the performance of our model, we could explore hyperparameter tuning by adjusting the value of C or trying out different solvers available in LogisticRegression().

Conclusion

We trust that our tutorial has provided you with a comprehensive understanding of how to implement logistic regression with Sklearn in Python. You have gained knowledge on what logistic regression is, how to construct models, interpret results, and some essential theoretical concepts. Additionally, we have covered important topics such as imbalanced datasets, accuracy, Recall, and confusion matrix. We hope that this tutorial has been helpful, and we encourage you to continue exploring the vast applications of logistic regression in the field of data science.

Key takeaways from this article

Sklearn is a popular library in Python for implementing logistic regression models.
Preprocessing steps such as data cleaning, normalization, and feature scaling are crucial before building the model.
Imbalanced datasets can affect the model’s accuracy, and techniques such as oversampling and undersampling can help overcome this issue.
Evaluating the model using metrics such as accuracy, precision, recall, and the confusion matrix is essential in understanding the model’s performance.

P.S.

I have compiled a collection of my past visualizations which follow a consistent format that includes a randomly generated dataset and corresponding syntax to create the different types of charts. Please feel free to suggest any visualization topics that you’d like me to prioritize on my list. If you found this article interesting, kindly consider following me on Medium. Enjoy reading and coding!