Credit Card Fraud Detection

Sandip Palit
6 min readOct 15, 2022

--

Credit Card Fraud Detection is the process of identifying purchase attempts that are fraudulent and rejecting them rather than processing the order.

It enables credit card companies to recognize fraudulent credit card transactions so that customers are not charged for items they did not purchase.

In this project, we achieved an accuracy of 99.75%. Follow this blog to know more about developing the Credit Card Fraud Detection script from scratch.

Exploring the Dataset

For this project, we used the Credit Card Fraud Detection dataset from Kaggle. The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where there are 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) accounts for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

For more information, kindly navigate to this link.

Developing the Script

Initially, we will import the necessary libraries for this script.
numpy: The fundamental package for scientific computing with Python.
pandas: A fast, powerful, flexible data analysis and manipulation tool.
matplotlib: For creating static, animated, and interactive visualizations.
sklearn: Simple and efficient tools for predictive data analysis.
tabulate: Pretty-print tabular data in Python.

We read the creditcard.csv file and load it into the pandas dataframe. We can use df.head() to view the data.

We used df.describe() to generate descriptive statistics, which include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

We used df.groupby([‘Class’]).size() to get the counts for all the distinct values of the Class column. Here, Class 0 means Normal Transaction and Class 1 means Fraud Transaction.

Class
0 284315
1 492
dtype: int64

We divided the whole Amount column range into several bins, and then displayed the Number of transactions of each bins for both the Normal and Fraud transactions. From the plot shown below, we can conclude that the Amount of Fraud transactions mainly lies between $0 and $2000.

We stored the column names of the input columns in the columnNames variable, but we removed the last column as it is the output column. The contamination variable contains the estimated proportion of outliers in the dataset.

Now we are moving into the core part of the script, i.e., the Model Prediction part. For this project we will use 2 ML models:
~ Isolation Forest
~ Local Outlier Factor

Both of the above-mentioned ML models have the capability to handle imbalanced data. The basic idea behind both of these models is to separate the Outliers from the normal data. Now we will look into each of the Models in detail.

Isolation Forest

The Isolation Forest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Logic: Isolating the Anomaly observations will be easier than the normal observations as only a few features will be needed to separate the anomaly observations. The Anomaly score increases as the number of branches increases. So the Anomaly observations will have a very less anomaly score compared to others.

Some of the important hyper-parameters are listed below:
n_estimators: The number of base estimators in the ensemble.
max_samples: The number of samples to draw from X to train each base estimator.
contamination: The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
random_state: Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest.

We initialized the IsolationForest() object with necessary hyper-parameters and stored it in cIF. Then we called fit_predict() to fit the values of the input columns and predict the values of the Class column, which we stored in the Class_IF column. Here, Class 1 means Normal Transaction and Class -1 means Fraud Transaction.

We used abs(df[‘Class_IF’]-1)//2 to convert Class 1 to class 0 and Class -1 to Class 1, so that it matches the convention of the Output column.

We used accuracy_score() to find the accuracy of the Isolation Forest algorithm, which is 99.748%. Then we used groupby() to verify the counts for all the distinct values of the Class_IF column.

Accuracy (IF):  0.9974825056968403
Class_IF
0 284314
1 493
dtype: int64

Local Outlier Factor

Local Outlier Factor works on the local deviation of the density of a given sample with respect to its neighbors. Here, Euclidean distance is used to calculate the distance.

Logic: The Density of the Anomaly observations will be very less as compared to the normal observations.

Some of the important hyper-parameters are listed below:
n_neighbors: Number of neighbors to use by default for kneighbors queries.
leaf_size: Leaf is the size passed to BallTree or KDTree. The optimal value depends on the nature of the problem.
contamination: The amount of contamination of the data set, i.e. the proportion of outliers in the data set.

We initialized the LocalOutlierFactor() object with necessary hyper-parameters and stored it in cLOF. Then we called fit_predict() to fit the values of the input columns and predict the values of the Class column, which we stored in the Class_LOF column. Here, Class 1 means Normal Transaction and Class -1 means Fraud Transaction.

We used abs(df[‘Class_LOF’]-1)//2 to convert Class 1 to class 0 and Class -1 to Class 1, so that it matches the convention of the Output column.

We used accuracy_score() to find the accuracy of the Local Outlier Factor algorithm, which is 99.659%. Then we used groupby() to verify the counts for all the distinct values of the Class_LOF column.

Accuracy (LOF):  0.9965906736842844
Class_LOF
0 284314
1 493
dtype: int64

Finally, we are comparing the Accuracy scores of both of the Models in a table structure, using tabulate().

╒══════════════════════╤════════════╕
│ Algorithm │ Accuracy │
╞══════════════════════╪════════════╡
│ Isolation Forest │ 0.997483 │
├──────────────────────┼────────────┤
│ Local Outlier Factor │ 0.996591 │
╘══════════════════════╧════════════╛

To look into the entire notebook, kindly navigate to this link.

If you have any doubts, feel free to ping me on LinkedIn.

Best Wishes & Happy Learning!!

--

--

Sandip Palit

I am a Data Science enthusiast, with a Passion for Learning new Technologies.