Machine Learning for Credit Card Fraud Detection & Dealing with Class Imbalance

Shaun Robert Commee
8 min readJun 4, 2024

--

Introduction

Fraud detection is a common use case for Machine Learning, given the complexity and requirement in some cases for immediate responses, it’s not possible for a human to perform this at such breakneck speed. Given this, companies opt to utilise these models to classify transactions quickly and reduce the impact of fraudulent transactions on its customer base.

For this project, I had access to 284,807 Credit Card transactions with 31 variables of information including whether that payment was considered fraudulent or not. You can gain access to this dataset here: Credit Card Fraud Detection (kaggle.com We were not provided with the meaning for each variable but purely the figures for output.

DataFrame Information

Target & Feature variables

We will firstly establish which of the variables we want the target variable to be. In this case, it will be ‘Class’ variable and then we will allocate the remainder as our feature variables. We are saying here that all of these feature variables have an impact on the target variable in someway or form.

Split between Target and Feature Variables

Scaling

For the majority of Machine Learning models to operate efficiently it’s best practise to apply scaling to the variables. In the case of MinMaxScaling, this takes the range of outputs within each respective column and scales them respective to a range of between 0 and 1. The maximum value will be scaled as 1 and the minimum as 0 and the remaining values falling between this range. This enables each value to be proportionately positioned relative to the range of the variable. Why does scaling all of the variables increase efficiency ? — well, without scaling those features that have much higher values are given more attention than those with smaller values. (imagine a value in the 100,000’s compared to a value that lies between 1 and 10 for example) This can create a bias towards these features in the model that can affect performance. This is particularly the case for ML models that rely on distance measures such as SVM/SVR and K-Nearest Neighbour.

Scaling Implementation — notice how all values lie between 0 and 1

Class Imbalance

A typical concern for fraud detection relates to class imbalance, this refers to a situation where the outputs of a target variable are significantly weighted towards one outcome. (in our case for cases of non-fraud) We are able to demonstrate that here with our data-set with the graph below. Why is this an issue? — well, the Machine Learning model will naturally focus its attention upon predicting the majority class. As the model alters its weights with respect to the error, the error is largely determined by the cases of non-fraudulent cases and therefore they are tuned in favour of predicting these correctly. This ultimately leads to poor model performance on the fraudulent cases within the data-set.

Class Imbalance

So what can we do about this? — there are multiple methods we can employ to combat this issue but the most common ones involve under sampling and oversampling. In the case of under sampling we reduce the number of Non-Fraud rows in the data-set so that they match up to the Fraudulent cases. Another technique which requires a more technical approach is SMOTE (Synthetic Minority Over-Sampling Technique) and this looks to create synthetic data points based on the variables that have determined a fraudulent transaction.

SMOTE

This process takes each individual minority class data point and generates a feature space bounded by this points K nearest neighbours. It then randomly selects one of these neighbours, calculating the distance between them through a linear line and randomly plotting a synthetic data point along it by multiplying the vector by a randomly generated number between 0 and 1. Below we can see an example of this, we have selected the number of neighbours as 5 and we can see that the synthetic data point is generated along the vector. Once we have implemented SMOTE, we now have a balanced data-set that is prepared for use in Machine Learning models.

SMOTE process demonstrated graphically
Code in Python & Graph showing balance in target variable

Model Deployment & Data Splitting

We will firstly need to establish our training and test data. The training data is the volume of data that will be used for training the model to understand the relationships in the data that determine whether a transaction is fraudulent or not. We leave behind a section of the overall data to test the models accuracy on unseen data to check if its suitable for deployment. We are able to alter the size of the test-set by changing the hyperparamter test_size. Typically the most common splits are 70/30, 75/25 and 80/20.

Splitting of the data-set

For this project we will use K-Nearest Neighbour, Random Forest and Logistic Regression and compare their performance with each other. So, firstly, let’s implement Logistic Regression.

Logistic Regression

A logistic regression model looks to understand the probability of an event given a set of variables. It begins by understanding the linear relationship between the target variable and feature variables. As part of this process there are coefficient outputs that are generated highlighting how much of an impact a feature variable has on the target. This value shows how a one unit increase in a feature variable impacts upon the target variable. (let’s take V5 for example, with a one unit increase in V5 we experience a 23.19 unit increase in the target variable) We then take these weights and multiply them by each variable for the transaction and then sum to get the overall output. We can then apply the sigmoid function to this value to get an output of between 0 and 1. This value we can then use to classify, in our position, a value that exceeds the threshold value of 0.5 will be considered fraud and if below the value then no fraud.

Logistic Regression curve on the left and coefficient values on the right

Code Demonstration

Below we have a Python demonstration of how to implement Logistic Regression to generate a model useful to produce predictions.

Logistic Regression code

Prediction Evaluation

Below we have a confusion matrix which shows the volume of errors committed in the data. We can see that there are instances where the model predicted that fraud wasn’t committed but in fact the transaction was fraudulent. This took place in 6089 instances across the data-set. We can observe the classification report and establish that we still achieved a 95% accuracy in allocating cases correctly. We however have incorrect outputs four-fold higher amongst fraud instances than non-fraud.

K-Nearest Neighbour

Another commonly used Machine Learning method is K-Nearest Neighbour. This uses a similar methodology employed to SMOTE but has a subtle difference, here, we will look at K number of closest neighbours to the tested data point. From here we determine the output by majority of neighbours within the neighbourhood. From the figure below we can see that the new data point is entered (represented by green) and dependent upon the number of neighbours we decide upon, we classify the data point based on the class that has the highest representation. We can see in this case that with the number of neighbours set to 4 we would classify the new data point as blue. In a situation where we set the number of neighbours to 9 we would end up classifying as red. (5 red vs 4 blue)

K-Nearest Neighbour

Code Demonstration

We can see below that we have used the number of neighbours to be 5 and we can observe that we have a 99.9% accuracy with the test-set. Reviewing the Confusion Matrix we can see that we experienced a 100% accuracy in determining whether transactions were fraudulent but there were 140 instances in which we predicted fraud but the transaction turned out to be non-fraudulent.

Random Forest

This Machine Learning technique looks to employ multiple Decision Trees and use a voting classifier across all of the decision trees to determine the classification output. For more on Decision Trees, you can review another article I have written here: Decision Trees & Random Forest. Amidst the criticism often directed… | by Shaun Robert Commee | May, 2024 | Medium

Code Implementation

We can see below that there is again a 99.9% accuracy output and less incorrect outputs overall however there are 4 instances that we incorrectly classified as non-fraudulent but turned out to be fraudulent.

Random Forest also has a useful component that enables us to look into the importance of each of the features as part of the decision-making process. We can therefore see which variables require most attention and we here we observe that V10 and V14 have the largest impact on whether the output is fraud or not.

Feature Importance

Conclusion

We have observed that to produce effective models for fraud detection we must firstly ensure that we have a balance between the outputs for the target variable. This is essential in order for Machine Learning models to not be trained in a bias manner towards non-fraudulent outcomes. We established we can do this in a number of ways, under sampling or oversampling. In the case of undersampling, we reduce the number of non-fraudulent outputs to align to the fraudulent outputs. In our case, we implemented oversampling through the form of SMOTE which looks to create synthetic data points based on the characteristics of the ones we already have available.

We established that the Machine Learning model that worked most effectively was K-Nearest Neighbour with a 100% success rate in identifying the fraudulent transactions. Despite the lower error output for Random Forest, there were 4 instances that fraudulent transactions had gone through unidentified.

--

--