Credit Card Fraud Detection. An Exercise In Class Imbalance.

Zain Khan
Analytics Vidhya
Published in
7 min readOct 7, 2020

‘Criminals thrive on the indulgence of society’s understanding’ — Henri Ducard (Ra’s Al Ghul)

That’s the energy I channeled for this data science project. To detect fraud, you must first understand why it came to be..

Well, not really. That was quite excessive. Over the top really..

(*clears throat*) There are many other ways to detect fraud which I will dive into for this article. The idea to detect fraud was because I wanted a deeper understanding of class imbalances in datasets. Class imbalances are when one certain class in our data (in this case actual fraudulent transactions) are overrepresented.

This makes intuitive sense. Fraud happens quite often but in the bigger picture it is a small percentage of the overall transactions. So how do we create classification models that can detect fraudulent transactions?

Dataset:

For this project, I’ll be using an anonymised dataset of transactions made by credit cards in September 2013. We have 30 features in the dataset out of which 28 were all created using Principle Component Analysis to protect personal information. That means that we won’t be able to infer much about the variables themselves. The other two variables are ‘Time’ and ‘Amount.’

A quick peek at the head of our dataframe.

All in all, we have 284,807 rows of data available to us and no null values. That saves us some time with data cleaning and means we can jump right in.

EDA:

Since we don’t have much information about what our variables are and how they relate intuitively to credit card transactions, we need to see if data visualisation can help us understand certain relationships and trends.

Correlation heat map:

No real findings here apart from a few stronger correlations within Time and Amount.

Analysing individual features:

I wanted to use this section to find if there were any obvious individual columns that could help with future feature engineering.

Testing V3, Time and Amount for all fraudulent transactions. I can’t craft any obvious conclusions from this dataframe.

Let’s look at individual data points.

Testing V5, Time and Amount for all fraudulent transactions. It’s interesting that there were 2 transactions under 1 (274,382 and 280,143 respectively).

Visualising box plots to see the difference in class distributions for various features.

V10 and V14 have some visible distribution spreads and outliers for fraudulent transactions (Class 1) while the rest are not as obvious..

Removing Outliers:

I had an internal debate to remove outliers or not. After an incredible amount of reading on class imbalance problems, I came to the conclusion that we don’t want information loss which could effect the accuracy of our models going forward. Changing the thresholds of the box plots above from the standard 25th and 75th percentile range can alter the way we analyse ‘extreme’ cases which could be key in discovering fraudulent transactions.

Modelling:

I set the target variable, Class as y and the rest of the columns as our model inputs, X. Then I split it into train and test sets with 25% in the test set with the remaining data utilised to train our models. Finally, I standardised the features because I believe it’s good practice (research on the true consequences of standardising PCA variables did not yield any glaring issues in constructing machine learning models so I continued with this decision).

Next, I ran the following models on our data:

1- LogisticRegression

2- DecisionTreeClassifier

3- KNeighborsClassifier

4- RandomForest

The baseline score for our data is:

0    0.998273 (Normal transactions)
1 0.001727 (Fraudulent transactions)

It should be clear at this point that we will not be assessing the ability of our models based on the scores, but rather focus on recall and sensitivity.

Recall, Sensitivity and Cost-Sensitive Classification:

A quick note.

When working with imbalanced data it is often good procedure to discount accuracy and focus instead on recall (sensitivity).

Why?

Well, let’s say we have a fantastic 99.8% accuracy for our model which, as a score, is absolutely fantastic. However, that’s essentially our baseline score. We should be accurately classifying 99.8273% of our target anyways. This is why we focus on recall (sensitivity).

Recall illustrates the number of correct classifications divided by total number of values in that class. In terms of this project, recall tells us how many fraudulent transactions we correctly identify as fraudulent. That’s the goal.

Another key area of importance when working with classification models and imbalanced datasets is understanding the cost of correctly or incorrectly classifying a positive class (fraud).

Would we rather our model err on the side of predicting too many transactions as fraudulent or vice versa? At this point, I can assume that a good percentage of us have experienced that phone call from our bank asking us if a recent transaction is fraudulent or not. That is because banks are much more lenient with False Positives than False Negatives. Intuitively, it makes perfect sense doesn’t it? It would cost the bank a lot more in safeguarding user accounts if their model skewed towards False Negatives rather than False Positives. It is a simple and fairly cheap process for a bank to call up a customer and validate a transaction while it is quite costly to reverse a transaction on a credit card and potentially send the customer a new card altogether.

Handling Imbalanced Data:

There are many ways to handle imbalanced datasets. Below are a few examples of strategies that I had considered:

1- SMOTE (Synthetic Minority Oversampling Technique)

2- Oversampling the minority class

3- Undersampling the majority class

There are many advantages and disadvantages to each sampling technique above. One of SMOTE’s biggest issues is that it does not take into consideration neighbouring examples from other classes. Oversampling can lead to overfitting since it is replicating the minority class and undersampling can potentially get rid of some vital information.

Instead of working through the pros and cons of each, I decided to try something I had previously attempted with the Ames Housing Dataset, altering the ‘class_weight’ parameters for each of my models to ‘balanced.’ This essentially weighs the proportion of data in each class and rebalances the classes accordingly.

Results:

I created a function I have been using for the longest time:

Add parameters for GridSearch:

Logistic Regression:

Logistic Regression always amazes me with its flexibility and performance on classification problems. As expected, it also yielded some fairly decent results.

A great result to minimise False Negatives but a large amount of False Positives. Is that you Barclays?

Decision Tree Classifier:

DecisionTree worked slightly better than my LogisticRegression when dealing with False Positives but also increased my False Negative results by ~300%.

Lower recall score than Logistic Regression

KNeighbors Classifier:

A model that seems to only focus on accuracy rather than recall. It predicted 0 fraudulent

Random Forest:

Random Forest seems to have the cleanest results with high precision and, more importantly, recall scores. However, the recall score does not compete with the Logistic Regression.

23 too many False Negatives. And 12 more than our best model.

As a result of my first iteration of results, I wanted to explore Logistic Regression further and potentially optimise it to the best of my ability. That led me to expand the parameters for our GridSearch and see if that would make any significant changes to the results.

Logistic Regression (after further optimisation):

Only 9 False Negatives now in the whole training set. While I have a large amount of False Positives and without the true understanding of the costs for a bank to assess certain fraudulent transactions (correctly or incorrectly), my first impression is that my newly optimised Logistic Regression model is the way forward.

Conclusion:

Fraud detection is an area of data science that can go as deep as you let it until you reach almost perfect detection. I could have used other models, explored SMOTE, sampling techniques and further feature engineering to improve results and recall scores.

The aim of this project was to illustrate the ways in which one can tackle imbalanced datasets and keep an eye on ‘real world’ consequences of shifting focus from accuracy to recall.

To put it quite simply, next time I get a call from any of my banks, I will make sure to ask them more questions about why they called me and assumed a certain transaction was fraud. I invite you to do the same. Just send them this article, I’m sure that won’t piss them off.. right?

— — —

GitHub | LinkedIn

--

--