Machine Learning Techniques for Credit Card Fraud Detection

Published in

Analytics Vidhya

8 min readMar 9, 2021

Online fraud is at an all time high. In our increasingly digital world, it has never been easier for fraudsters to exploit people using the internet. To make matters worse, many consumers have passwords, credit card numbers, and other sensitive information leaked on the dark web. This results in accounts being taken over and new fraudulent lines of credit being opened everyday. Many remain completely unnoticed.

Credit card fraud is among the most common online scams. Credit card numbers, PINs, and security codes can easily be stolen and used to make fraudulent transactions. This can result in huge financial losses for merchants and consumers. However, credit card companies are ultimately responsible to pay back any losses to their customers. Thus, it is extremely important for credit card companies and other financial institutions to be able to detect fraud before it occurs.

Machine learning has become an increasingly accessible and reliable method to detect fraudulent transactions. Using a historical dataset, a machine learning model can be trained to learn patterns behind fraudulent behavior. A model can then be applied to filter out fraudulent transactions and stop them from occurring in real time.

This post will examine 4 commonly used machine learning methods for fraud detection. These include:

Random Forest
CatBoost
Deep Neural Network (DNN)
Isolation Forest

We will dive into the basics of how to create these models in Python, and compare how they perform against one another. Lets get started!

Setup

To begin, we need to import some Python libraries that we will use for data manipulation, modeling, and evaluation.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Activation
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import matplotlib as mpl
import matplotlib.pyplot as plt#configure plot size and colors 
mpl.rcParams['figure.figsize'] = (10, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

To evaluate model performance with will use the Receiver Operator Charicteristic (ROC) using sklearn roc_curve and auc. This will help us understand how our models perform in terms of predicting true positives and false positives. Generally, institutions will want a high true positive rate at a low, fixed false positive rate. Thus, ROC curves are a meaningful way to measure performance for a fraud classifier.

Data Preparation

To evaluate different machine learning methods for fraud detection, we will use the “Credit Card Fraud Detection” dataset that is publicly available on Kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud). This dataset contains multiple credit card transactions from European card holders in 2013. Many of the features found in this dataset are transformed using PCA, due to the fact that the original dataset contains sensitive personally identifiable information (PII).

Each transaction is truth-marked as either Fraud or Not Fraud in the column “Class”. Taking a look at class percentages, it is clear that only a tiny percentage of the transactions are fraudulent (00.17 %). This makes training a classifier challenging due to a large class imbalance.

#load data
df = pd.read_csv('creditcard.csv')#drop NULL values
df = df.dropna()#drop Time column (contains limited useful information)
df = df.drop('Time', axis = 1)#group data by Class
groups = df.groupby('Class')
fraud = (groups.get_group(1).shape[0] / df.shape[0]) * 100
non_fraud = (groups.get_group(0).shape[0] / df.shape[0]) * 100#print class percentage
print('Percent Fraud: ' + str(fraud) + '%')
print('Percent Not Fraud ' + str(non_fraud) + '%')

Next, we will create a train set and a holdout set with our data so we can quickly evaluate how our model works on brand new, unseen data.

df_size = df.shape[0]
test_size = int(df_size * .3)
train_size = df_size - test_sizetrain_df = df.head(train_size)
test_df = df.tail(test_size)X_train = train_df.drop('Class', axis = 1)
Y_train = train_df['Class']
X_test = test_df.drop('Class', axis = 1)
Y_test = test_df['Class']

Finally we will apply a standard scaler to all of our features so that they have a mean of 0 and standard deviation of 1. This will help our models learn more efficiently.

We fit our Standard Scalar on the train set only, to prevent it from biasing how it transforms our test set.

for feat in X_train.columns.values:    ss = StandardScaler()    X_train[feat] = ss.fit_transform(X_train[feat].values.reshape(-1,1))    X_test[feat] = ss.transform(X_test[feat].values.reshape(-1,1))

Now that we got our data ready to go, time to start building some models!

Method 1: Random Forest

The first method we will use to train a fraud classifier is random forest. A random forest is a popular supervised machine learning algorithm that can be used for both classification and regression tasks. The model works by sampling the training dataset, building multiple decision trees, and then having the output of the decision trees determine a prediction. This model can easily handle large datasets with high dimensionality, and can also handle categorical values easily. However, it can potentially suffer from the large class imbalance in our data.

First, lets initiate a basic random forest model and train it using our training data. Then, we will retrieve the probabilities of each data point in our test set.

#create Random Forest Model
rf = RandomForestClassifier()#fit to training data
rf.fit(X_train, Y_train)#get class probabilities
probabilities = clf.predict_proba(X_test)
y_pred_rf = probabilities[:,1]

Next, lets calculate some basic performance metrics. We will calculate false positive rate (FPR), true positive rate (TPR), and the area under the ROC curve.

fpr_rf, tpr_rf, thresholds_rf = roc_curve(Y_test, y_pred_rf)
auc_rf = auc(fpr_rf, tpr_rf)

Finally, lets plot the performance of our model using a ROC Curve (Receiver Operator Characteristic). This will help us understand the relationship between true positives and false positives that our model produces on our test set. If our model is performing well, we should see a high true positive rate at a low false positive rate.

plt.plot(100*fpr_rf, 100*tpr_rf, label= 'Random Forest (area = {:.3f})'.format(auc_rf), linewidth=2, color = colors[0])plt.xlabel('False positives [%]')
plt.ylabel('True positives [%]')
plt.xlim([0,30])
plt.ylim([60,100])
plt.grid(True)
ax = plt.gca()
ax.set_aspect('equal')
plt.title('Random Forest Model Performance')
plt.legend(loc='best')

Looks pretty good! Random Forest appears to be a good machine learning method for fraud detection. It has a consistently high true positive rate (TPR) across varying false positive rates (FPR).

Lets take a look at some more models and see how they compare.

Method 2: CatBoost

The next method we will try is CatBoost, an open source library for gradient boosting on decision trees. The CatBoost algorithms works by building decision trees consecutively and minimizing loss with each new decision tree that is built. This algorithm is notorious for providing great results without the need for a lot of parameter tuning. Furthermore, CatBoost is designed to work well with imbalanced data, which makes the algorithm perfect for use in fraud detection.

Lets initiate a default CatBoost model and fit it to our training data. Then, lets get class probabilities for our test set.

#create CatBoost Model
clf = CatBoostClassifier()#fit to our data
clf.fit(X_train, Y_train)#generate class probabilities
y_pred = clf.predict(X_test, prediction_type='RawFormulaVal')

Now, lets do some evaluation and see how our model performed.

At first glance, it looks like CatBoost has already toppled the Random Forest algorithm, delivering an AUC value of .978. This is a significant lift compared to our random forest, which only had an AUC value of .928.

Method 3: Deep Neural Network (DNN)

The next method will will try is a deep neural network. A neural network is an incredibly powerful machine learning method that is inspired by how neurons work in the brain. Neural Networks continue to be applied across many machine learning problems such as image recognition, speech detection, and self-driving cars. These models are incredibly powerful because they learn complex relationships between input and output variables that are difficult for other models to identify. However, a downside of using neural networks is that that they can sometimes require a lot of fine tuning to produce ideal results.

Unfortunately, I do not have the time (or expertise, quite frankly) to really dive into exactly how to build and train a powerful neural network. However, I have provided a baseline neural network that I built in Keras and have found to be successful in fraud detection. This neural network includes 3 Dense layers to learn features from our data and 2 Dropout layers to prevent overfitting.

#Design and compile model
DNN = Sequential()
DNN.add(Input(shape=(X_train.shape[1],)))
DNN.add(Dense(100, activation='relu'))
DNN.add(Dropout(0.5))
DNN.add(Dense(100, activation='relu'))
DNN.add(Dropout(0.5))
DNN.add(Dense(10, activation='relu'))
DNN.add(Dense(1, activation='sigmoid'))
DNN.compile(loss='binary_crossentropy', optimizer='adam', metrics = keras.metrics.AUC(name='auc'))#fit model
DNN.fit(X_train, Y_train, epochs=10)#generate class probabilities
y_pred_DNN = DNN.predict(X_test).ravel()

After fitting our model and generating our prediction probabilities, lets see how well our DNN performed.

Outstanding! It seems that both CatBoost and our DNN are our best performing models so far.

Method 4: Isolation Forest

The final method we will try using is an Isolation Forest (or “iForest”); which is a very different approach to fraud detection than our previous methods. So far we have only looked at supervised learning methods, where models are trained based on truth-marked data. Meanwhile, Isolation Forest is an unsupervised learning method, meaning that it does not require any truth-marking to make predictions, and only learns from patterns it finds in the training data.

In the real world, companies do not always have truth-marked fraud data available. For example, if a company is trying to deploy a solution for the first time and it does not have a lot of examples of fraud available to use, it is not possible to train a supervised classifier. Furthermore, companies may not be able to share truth-marked fraud data do to security reasons, or may simply not have any fraud examples available.

Isolation Forest is a tree-based algorithm used for anomaly detection. The algorithm works by using decision trees to isolate outliers from the data. In theory, our fraud population should mostly consist of data points that are abnormal compared to the rest of our transactions. Thus, this is a perfect solution to try to identify fraud when we don’t have any labels.

Lets implement our iForest in Python. Notice how we do not need “Y_train”, as our model does not need any truth-marking to be trained.

#create iforest model
iforest = IsolationForest()#fit to data
iforest.fit(X_train)#generate class probabilities
y_pred_iforest = - iforest.decision_function(X_test)

Finally, lets see how it performed.

This method performed surprisingly well at identifying the fraud population. However, it produces more false positives than the other methods we tried. In a situation where it is warranted, this could be incredibly useful.

Conclusion

Below is a comparison of all of the machine learning methods we explored for credit card fraud detection.

Deep Neural Networks and CatBoost seemed to work the best at detecting fraud in our test dataset. However, multiple other methods could be useful depending on the context of the fraud problem. As more advanced machine learning methods continue to be developed, it will be interesting to see which methods work best at identifying fraud.

Github repo with all code/visuals here: https://github.com/ryankemmer/CreditCardFraudDetection