Credit Card fraud Detection with Machine Learning in Python

Using kNN, Logistic Regression, Random Forest, Decision Tree, SVM ,XGBoost and Lightgbm

Ogunbajo Adeyinka
6 min readJul 25, 2021
Photo by Nahel Abdul Hadi on Unsplash

“A collection of operations performed to prevent money or property from being acquired under false pretences is known as fraud detection.”

Fraud may be committed in a variety of ways and a wide range of sectors. To make a judgment, the majority of detection techniques integrate several fraud detection datasets to provide a linked picture of both legitimate and invalid payment data.

In 2017, 16.7 million people were victims of unauthorized card activities. According to the Federal Trade Commission (FTC), the amount of credit card fraud claims in 2017 was 40 per cent greater than the previous year. Around 13,000 instances were recorded in California and 8,000 in Florida, the two states with the highest rates of this sort of crime per capita. By 2020/21, the stakes will have risen to over $30 billion.

Credit Card Fraud Detection using Machine Learning is a method that involves a Data Science team investigating data and developing a model that will uncover and prevent fraudulent transactions. This is accomplished by combining all relevant aspects of cardholder transactions, such as Date, User Zone, Product Category, Amount, Provider, Client Behavioral Patterns, and so on. The data is then fed into a model that has been discreetly trained model that finds patterns and rules so that it can classify whether a transaction is fraudulent or is legitimate.

The Conventional Fraud Detection:

  • It takes a long time to complete.
  • Multiple means of verification are required, which is cumbersome for the user.
  • Only detects fraud that is evident.
  • The decision-making procedures for determining schemes should be manually established.

Why Machine Learning-based Fraud Detection:

  • Detecting fraud automatically
  • Real-time streaming
  • Less time is needed for verification methods
  • Identifying hidden correlations in data

Basis

For this project let’s assume you’ve been employed by a financial Institution (E.g Commercial Bank) to help Customers to be certain that they will not be charged for things they did not buy by detecting possible fraudulent activities. You’re provided with a dataset with people’s transactions and information on whether or not they’re fraudulent, and you’re expected to distinguish between them. Today, that will be our quest.

Project Strategy

The project steps are shown below. In this article,I will display short python scripts used to achieve some of the results from my analysis.

Credit Fraud Detection Project Flowchart(Image developed by Ogunbajo Adeyinka using Microsoft Power Point)

Importing Libraries

  • We will be using Python Libraries including Pandas(Data Cleaning/Manipulation),Numpy(working with arrays),Matplotlib(Data Visualization),StandardScaler (data normalization).
  • Before getting started let’s import libraries needed for the project using the python scripts below:
Python Script to Import the python Libraries needed(Code by Ogunbajo Adeyinka)

Exploratory Data Analysis and Data Processing:

  • Like I said in my previous article, Exploratory Data Analysis — EDA for short,help you to know the missing values,correlating features and identify different trends that exist in the data set.
  • Below is a Python Scripts to import data set into Jupyter notebook environment:
Python Script to Import the Credit card data sets needed for this Project(Code by Ogunbajo Adeyinka)
  • We have an overview of what our data looks like below:
Pandas DataFrame Showing a Credit Card sample dataset (Screenshot from Jupyter Notebook written by Ogunbajo Adeyinka)

Now we have our data in a suitable environment.

  • It comprises features V1 through V28 which are the major components derived using PCA. We are going to disregard the time element which is of no value to create the models. The ‘Amount’ feature, which provides the total amount of money being transacted, and the ‘Class’ feature, which contains whether the transaction is a fraud case or not, are the remaining features.

Now , we will explore :

  • Number of Cases.
  • Number of non-fraudulent Cases.
  • Number of Fraudulent Cases.
  • Percentage of Fraudulent Cases.

Below is a Python Scripts to explore the data set in Jupyter notebook environment:

Python Script to Explore the Number/Nature of Cases in the Credit Card sample dataset (Code by Ogunbajo Adeyinka)
  • We have the Output below as:
Pandas DataFrame Showing the Number/Nature of Cases in the Credit Card sample dataset (Screenshot from Jupyter Notebook written by Ogunbajo Adeyinka)
  • From the diagram above we can see that we have 284,807 samples with 492 cases being fraudulent which is about 0.17 percentage of the whole sample data.The data is imbalanced and thereby we need to exercise caution when modelling and evaluating.

Let’s get a statistical view of both fraud and non-fraud transaction amount data using the ‘describe’ method in python.

Below is a Python Scripts to explore the data set in Jupyter notebook environment:

Python Script to describe the Credit Card sample dataset Statistically (Code by Ogunbajo Adeyinka)
  • We have the Output below as:
Pandas DataFrame Showing the Statistical description of the features of the Credit Card sample dataset (Screenshot from Jupyter Notebook written by Ogunbajo Adeyinka)

Data Normalization

While looking through the Statistical description of the features of the Credit Card sample dataset.It is seen that the values in the ‘Amount’ variable are varying enormously when compared to the rest of the variables. To reduce its wide range of values, we can normalize it using the ‘StandardScaler’ method in python.

Below is a Python Scripts to reduce its wide range of values,in the sample data set:

Python Script to normalize the wide range of values of the Credit Card sample dataset(Code by Ogunbajo Adeyinka)
  • We have the Output below as:
Pandas DataFrame Showing the normalized wide range of values of the Credit Card sample dataset (Screenshot from Jupyter Notebook written by Ogunbajo Adeyinka)

Feature Extraction and Data Splitting

Here, we will split the data into a training set and testing set which is further used for modeling and evaluating. We can split the data easily using the ‘train_test_split’ algorithm in python.

Below is a Python Scripts to split the data set in Jupyter notebook environment:

Python Script to split Credit Card sample dataset(Code by Ogunbajo Adeyinka)
  • We have the Output below as:
Pandas DataFrame Showing the Splitted Credit Card sample dataset (Screenshot from Jupyter Notebook written by Ogunbajo Adeyinka)

Building Model

We will be building Seven (7) different types of classification models namely Decision Tree, K-Nearest Neighbors (KNN), Logistic Regression, Support Vector Machine (SVM), Random Forest,XGBoost and lightgbm .

Now, Let’s get to work by implementing these models in python using the python script below:

Python Script to build the models for the project(Code by Ogunbajo Adeyinka)

Evaluation of Model

Our main objective in this process is to find the best model for our given case. The evaluation metrics we are going to use is the f1 score metric.

Pandas DataFrame Showing the evaluation of each model using f1_score metrics(Screenshot from Jupyter Notebook written by Ogunbajo Adeyinka)

Final Thoughts

We will observe that from our outputin the pandas dataframe above,we have :

  • KNN Model as the most performing model with approx 86% accuracy.
  • XGBoost Model as the most performing model after the KNN Model with approx 85% accuracy.
  • Decision Tree Model as the most performing model after the XGBoost Model with approx 81% accuracy.
  • Lightgbm as the most performing model after the Decision Tree Modelwith approx 80% accuracy.
  • SVM Model as the most performing model after the Lightgbm Model with approx 77% accuracy.
  • Random Forest Model as the most performing model after the SVM Model with approx 76% accuracy.
  • Logistic Regression Model as the most performing model after the Random Forest Model with approx 74% accuracy.

Thanks for taking the time to read this article. You can read more articles by going to my profile(more will be available soon).

Remarks

All the references used were hyperlinked within the article. To see the complete python code written on Jupyter Notebook, Github, and my Social Media pages. Kindly use the links below:

--

--

Ogunbajo Adeyinka

Artificial Intelligence 🤖 | Data Science 🔬 📈 | Product Management 🎨