Machine Learning Case Study: Credit Card Fraud Detection

In this article, I demonstrate how to execute a machine learning problem, end to end.

Puja P. Pathak
CodeX
7 min readFeb 12, 2022

--

Photo credits — Stephen Phillips on Unsplash

It would not be wrong to say that payment cards like credit/debit cards have become a lifeline for almost all of us. Credit cards are an attractive spending option because, you spend the money, but have to technically pay later for it. With the Buy Now Pay Later phenomenon gaining traction, the number of customers adopting this method of spending is increasing day by day.

Plus, these spending options come with attractive cash backs and credit points, which you can further use to convert to cash or buy stuff. Moreover, nowadays, the competition in this segment has become intense, with new FinTech players introducing many agile and hassle-free spending options.

But with the ease of spending comes responsibility and a whole lot of scenarios where your payment cards become susceptible to frauds. When money on a payment card is stolen by unfair means or dishonest means it is called as a fraud. Various ways of committing this fraud are Skimming machines, tampering genuine cards, creation of counterfeit cards, stealing the cards, marketing fraud, etc.

With the help of machine learning models, fraud triggers can be identified and trackers can be built. These trackers, can track such frauds and help mitigate them, real time.

This is an interesting Machine Learning Case Study where I will demonstrate how a real world problem like Credit Card fraud can be detected and what are the triggers for it. By identifying such triggers banks and financial institutions can put checks at proper places which may help them reduce the number of fraud occurrences.

So let’s dive straight into the data set. This dataset was taken from Kaggle. All the details and code are present in my GitHub profile on the link given at the end of this article.

We need to develop a machine learning model to detect fraudulent transactions based on the historical transactional data of customers with a pool of merchants.

Flow of Analysis

1. Import the required libraries
2. Read and understand the data
3. Data clean-up and preparation
4. Exploratory Data Analysis
5. Model building
6. Hyperparameters Tuning

  1. Import the required libraries

2. Read and understand the data

The data is provided as train and test sets separately.

Same columns are present in both train and test sets

Our target variable is is_fraud.

3. Data Cleaning and Preparation

On analysing the train and test sets, it was found that both the data sets are imbalanced. So I need to concatenate both the data sets to make a master data set. I call this data set as ‘data’.

The data set is clean, meaning there are no missing values. I dropped some unwanted columns like ‘Unnamed: 0’, ‘cc_num’, ‘street’, ‘zip’, ‘trans_num’, ‘unix_time’. I added certain columns based on the present data, like calculating distance between the customer location and merchant location based on the customer and merchant latitude and longitude co-ordinates. Also, I formed buckets for various numeric columns like age_group from calculated age column, population groups from city_pop etc. Extracted hour, day, month, year etc. information from trans_date_trans_time column.

4. Exploratory Data Analysis

is_fraud variable class distribution

As we can see, the dataset is imbalanced, with 0.52% of transactions being fraudulent (represented by 1) and 99.48% transactions being non-fraudulent (represented by 0).

I performed univariate, bivariate and multivariate analysis on the data. Following are some of the graphs from the EDA.

Univariate Analysis
Bivariate Analysis
Multivariate Analysis

Checking Skewness in the data : In univariate analysis, it was found that the ‘amt’ column is skewed to the left. So I used, the power transform method, to normalize it.

amt column in first image

5. Model building

Train-Test split is performed in the ratio 70:30, in order to create the train and test sets.

Dummy variables are created for categorical columns.

MinMaxScaler is used to perform scaling.

I used random forests to perform feature selection of the most important features. 19 features were selected.

Now, since the data has imbalance, we need to use certain sampling techniques in order to balance the classes 0 and 1 in the target variable is_fraud.

We can handle imbalanced classes, by increasing minority or decreasing majority class. This can be done by using the following few techniques :

  1. Random Under-Sampling
  2. Random Over-Sampling
  3. SMOTE — Synthetic Minority Oversampling Technique
  4. ADASYN — Adaptive Synthetic Sampling Method
  5. SMOTETomek — Over-sampling followed by under-sampling

Undersampling technique reduces the number of data points in the data in order to achieve the class balance. This results in loss of information. So it is usually not preferred. I have used Random Over-Sampling, SMOTE — Synthetic Minority Oversampling and ADASYN — Adaptive Synthetic Sampling methods.

I selected 4 models, starting from the very simple Logistic Regression. Then I built the Decision Trees. Finally the ensemble techniques like XGBoost and Random Forests.

First I built the base models for all the above 4 mentioned types. Then I build each of the 4 models using various oversampling techniques.

Random Oversampling
SMOTE
ADASYN

4 base models + (3 sampling techniques X 4 machine learning models) = 16 models

The codes for all the models are present in the python file on my GitHub profile.

The evaluation parameters given by all the 16 models are summarized as follows:

Since we need to focus on identifying the fraudulent cases, our focus is to identify the cases with class 1. So the models that give us the best combination of ACCURACY and RECALL are to be selected. If we compare all the executed models as in above image, we see that we get a decent ACCURACY RECALL combination for the following models :

  • Logistic Regression SMOTE model
  • Decision Tree SMOTE model
  • XGBoost ADASYN model
  • Random Forest ADASYN model

Hence we perform fine tuning of these models to extract 10 most important features that are better predictors of fraud.

6. Hyperparameters Tuning

I used the RFE (Recursive Feature Elimination) technique with Logistic Regression SMOTE Model. Hyper-parameter tuning using Cross Validation technique was used for Decision Tree SMOTE, XGBoost ADASYN and Random Forest ADASYN models.

Now, there are 2 options in this GridSearchCV and RandomizedSearchCV.

GridSearchCV technique is quite lengthy and goes through all the given hyperparameters range list. RandomizedSearchCV on the other hand randomly selects the hyperparameters combination. So I used RandomizedSearchCV.

BEST Models :

Evaluation Metrics for XGBOOST ADASYN model
ROC and feature importance given by tuned XGBOOST ADASYN model
Evaluation Metrics for Random Forest ADASYN model
ROC and feature importance given by tuned Random Forest ADASYN model

As we can see, XGBOOST ADASYN model and Random Forest ADASYN model give us the best results.

Based on the important features given by each of the above two best models, following points can be concluded.

  • In case the transaction amount seems unusual, the transaction should be checked for authenticity
  • All the transactions performed during ODD Hours of the day like after 10pm or 11pm need checking
  • Maximum faudulent transactions take place under the category gas_transport
  • Categories shopping_net, shopping_pos, grocery_pos, misc_net, misc_pos also contribute to high likelihood of fraudulent transactions
  • Considering all the above factors, mostly frauds are likely to be performed by male customers

These form the important triggers for fraud identification and using these triggers, techniques can be built further to mitigate the frauds in real time.

This completes the analysis. Hope the article was informative and easy to understand. Also, I hope you enjoyed analysing the colourful graphs that were included in the analysis.

Do feel free to comment and give your feedback.

You can connect with me on LinkedIn: https://www.linkedin.com/in/pathakpuja/

Please visit my GitHub profile for the python codes. The code mentioned in the article, as well as the graphs, can be found here:

--

--

Puja P. Pathak
CodeX

Data Enthusiast | Daughter | Sister | Wife | Mother | X-Banker | Reader | Loves to write | Ideas, opinions, views are personal |