Credit Card Fraud Detection

Kamlesh Solanki
Analytics Vidhya
Published in
9 min readMar 18, 2021
Source

In this article we are going to solve credit card fraud detection problem using various machine learning algorithms and we’ll also compare all of them and find which one is best for this problem.

Article is divided into several parts

  1. Problem understanding
  2. Data review
  3. Data processing
  4. Feature Selection
  5. Models Building
  6. Under-sampling
  7. Oversampling
  8. Summary

So let’s get into it.

1. Problem understanding

Let’s first understand what a credit card fraud is

Credit fraud is the criminal use of someone else’s personal credential, as well as their credit standing, to borrow money or use credit card to purchase goods or services with no intention of repaying the debt.

So it is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

2. Data review

Dataset contains transactions made by credit cards by European cardholders. You can find entire data set from Kaggle Credit card fraud detection competition dataset here. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

Their are 31 columns in dataset in which features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Their is a column named class in whose value is 0 if transaction is legitimate(not fraud) and 1 if it is fraud.

3. Data processing

First of all let’s visualize data

Data

By seeing data values of feature Amount lies in range of 0 to 25691.16 which is obviously very wide range. So we need to scale its value between (-1,1) by applying Standardization technique.

So what we are doing is

from sklearn.preprocessing import StandardScalerdata['scaled_Amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))data = data.drop(['Amount'],axis=1)

So by using StandardScaler function from sklearn’s preprocessing module we can apply transformation to Amount column present in dataset.

What is StandardScaler?

Standard scaler scales the features to have 0 mean and standard deviation of one. It output something very close to normal distribution.

Note that we are doing reshape(-1,1) to convert series of an array to 2D array.

scaled amount

In above image you can see output.

4. Feature Selection

Time

let’s plot the graph showing the similarity of legitimate transaction and fraud transaction with respect to time variable.

data["Time_Hr"] = data["Time"]/3600 # convert to hours
print(data["Time_Hr"].tail(5))
fig, (ax1, ax2) = plt.subplots(2, 1, sharex = True, figsize=(10,6))
ax1.hist(data.Time_Hr[data.Class==0],bins=48,color='g',alpha=0.5)
ax1.set_title('Genuine')
ax2.hist(data.Time_Hr[data.Class==1],bins=48,color='r',alpha=0.5)
ax2.set_title('Fraud')
plt.xlabel('Time (hrs)')
plt.ylabel('# Transactions')
Time variable

By seeing the graph we can say that they have nothing much information to prove that transaction is fraud. So it is not having any prediction power coz of it we can simply drop the time variable from dataset.

Now let’s check amount variable

fig, (ax3,ax4) = plt.subplots(2,1, figsize = (10,6), sharex = True)
ax3.hist(data.Amount[data.Class==0],bins=50,color='g',alpha=0.5)
ax3.set_yscale('log') # to see the tails
ax3.set_title('Genuine') # to see the tails
ax3.set_ylabel('# transactions')
ax4.hist(data.Amount[data.Class==1],bins=50,color='r',alpha=0.5)
ax4.set_yscale('log') # to see the tails
ax4.set_title('Fraud') # to see the tails
ax4.set_xlabel('Amount ($)')
ax4.set_ylabel('# transactions')
Amount variable

So you can clearly see a vast difference in the both graphs i.e Genuine and Fraud. Also the main point is that in genuine(Legitimate transaction) the transactions of more than 10k are done but in fraud cases no transactions are more than 10k.

So we can say it is having prediction power.

Similarly we need to check all features V1 to V28 so that if any variable is not having difference in legitimate and fraud transaction then we can drop that feature. You can see analysis of every variable at my github repo.

Now we have selected all the trainable features so we can split data into train and test set.

Split Data

We are going to split data into 80% training set and 20% test set

def split_data(df,drop_list):
df=df.drop(drop_list,axis=1)
X=df[df['Class']==1]
Y=df[df['Class']==0]
X=df.drop(['Class'], axis = 1)
Y=df["Class"]
xData = X.values
yData = Y.values
xTrain, xTest, yTrain, yTest = train_test_split(
xData, yData, test_size = 0.2, random_state = 42)
return xTrain, xTest, yTrain, yTest
drop_list = ['V13','V15','V22','V26','V25','V23']
x_Train,x_Test,y_Train,y_Test=split_data(data,dropList)

We are creating helper function print_scores which will take y_Test,y_preds and y_pred_probs(prediction probability) and will print all the measures for us i.e. precision, recall,f1_score,AUC_score,ROC_AUC score,and kappa score(cohen kappa score).

# print_scores function for printing scores
def print_scores(y_test,y_pred,y_pred_prob):
precision,recall,_ = precision_recall_curve(y_test,y_pred_prob[:,1])
print('precision_score : ',precision_score(y_test,y_pred))
print('recall_score : ',recall_score(y_test,y_pred))
print('f1 score : ', f1_score(y_test,y_pred))
print('AUC score : ', auc(recall,precision))
print('ROC_AUC score : ', roc_auc_score(y_test,
y_pred_prob[:,1]))
print('kappa : ', cohen_kappa_score(y_test,y_pred))

5. Model Building

Specifically I have build 7 different models

1. Naïve Bayes

  • It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
  • Specifically it uses concept of Bayes theorem straight from probability.
  • Eg. We are calculating probability of transaction being fraud given transaction is legitimate. Similarly for being legitimate and final prediction is max of both.
  • For more information you can visit here.
from sklearn.naive_bayes import GaussianNB
NBclf=GaussianNB()
NBclf.fit(x_Train,y_Train)
NB_pred,NB_pred_prob=NBclf.predict(x_Test),NBclf.predict_proba(x_Test)
print_scores(y_Test,NB_pred,NB_pred_prob)

2. Logistic Regression

  • Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
  • In Simple terms it uses Logistic function(Sigmoid function) to carry out prediction.
  • You can learn more about Logistic Regression Over here.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C = 0.01, penalty = 'l2',max_iter=1000)
lr.fit(x_Train, y_Train)
lr_pred,lr_prob=lr.predict(x_Test),lr.predict_proba(x_Test)
print_scores(y_Test,lr_pred,lr_prob)

3. Linear Discriminant Analysis

  • Linear discriminant analysis, normal discriminant analysis, or discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda_clf=LinearDiscriminantAnalysis()
lda_clf.fit(x_Train,y_Train)
lda_pred, lda_prob = lda_clf.predict(x_Test),lda_clf.predict_proba(x_Test)
lda_precision,lda_recall,_ = precision_recall_curve(y_Test,lda_prob[:,1])
print_scores(y_Test,lda_pred,lda_prob)

4. Decision Tree

Decision Tree
  • I hope you get to know what decision tree is from above image.
from sklearn.tree import DecisionTreeClassifier
Dtree = DecisionTreeClassifier()
Dtree.fit(x_Train,y_Train)
DT_preds,DT_probs = Dtree.predict(x_Test),Dtree.predict_proba(x_Test)
print_scores(y_Test,DT_preds,DT_probs)

5. Random Forest

Random Forest(Random Forest classifier) is collection of many such decision trees and combination of many decision trees.

Google
from sklearn.ensemble import RandomForestClassifier
RF_clf = RandomForestClassifier()
RF_clf.fit(x_Train, y_Train)
RF_pred,RF_prob = RF_clf.predict(x_Test),RF_clf.predict_proba(x_Test)
print_scores(y_Test,RF_pred,RF_prob)

6. Support Vector Machine

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

Its Concept is to take N dimension input and expand it to N+1th dimension and find the hyperplane and then bring it down to Nth dimension. We get the decision boundary.

from sklearn.svm import SVC
clf = SVC(probability=True)
SVM_pred,SVM_prob=SVMclf.predict(x_Test),SVMclf.predict_proba(x_Test)
print_scores(y_Test,SVM_pred,SVM_prob)

7. Deep neural network

I have build a 5 layered neural network through which I got really great results.

But you can fine tune to get much better performance.

I am using 4 dense layers and 1 dropout layer with threshold 0.5 and optimizer as adam and loss as binary_crossentropy because it is an classification problem specifically 2 Binary classification problem.

import keras
from keras import layers
from keras.models import load_model
# Deep Neural Network

model = keras.Sequential([
layers.Dense(input_dim = 23,units= 23, activation = 'relu'),
layers.Dense(units = 20,activation = 'relu'),
layers.Dropout(0.5),
layers.Dense(units = 16,activation = 'relu'),
layers.Dense(units =1, activation = 'sigmoid')])
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(x_Train, y_Train, batch_size = 16, epochs = 5)
Dnn_preds,Dnn_prob= model.predict(x_Test),model.predict_proba(x_Test)print_scores(y_Test ,DNN_preds)print('precision_score :',precision_score(y_Test,Dnn_preds.round()))
print('recall_score : ',recall_score(y_Test,Dnn_preds.round()))
print('f1_score : ',
f1_score(y_Test,Dnn_preds.round()))
print('AUC_score : ', auc(DNN_recall,DNN_precision))
print('ROC_AUC_score : ', roc_auc_score(y_Test, Dnn_prob))
print('kappa : ', cohen_kappa_score(y_Test,Dnn_preds.round()))

By fitting data to all the models we get below results.

Performance

By seeing above performance table you can definitely say that Random Forest is winning the race.

Let’s see precision vs recall score for our winner Random Forest

plt.figure(figsize=(8,6))
plt.title('Precision Recall Curve of Random Forest Classifier')
plt.plot(RF_recall, RF_precision, label='Random Forest',color='violet')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(loc='lower left')
plt.show()
Precision-Recall graph

Similarly you can plot graph of any classifier. I am just showing you the best one among all.

You can get entire code from here.

But there is a problem in our dataset

Pie chart of data

Here you can see data is highly biased towards true negative i.e. Genuine(Legitimate) transaction.

We’ll see later what we can do to overcome this problem.

we know that this data is highly bias towards negative close i.e. Legitimate transaction So we need to do something to balance the data distribution.

There are two techniques we can apply Undersampling and Oversampling

let’s see Undersampling first.

6. Under-Sampling

Undersampling, that consists of reducing the data by eliminating examples belonging to the majority class with the objective of equalizing the number of examples of each class.

So we are going to remove some of the data from Legitimate transactions in order to balance the data.

Let’s see the code for Undersampling. Note that we are selecting random choices and deleting them to get distribution same. This we are doing at the cost of some useful information. 😢

df=data
fraud_ind = np.array(df[df.Class == 1].index)
gen_ind = df[df.Class == 0].index
n_fraud = len(df[df.Class == 1])
# random selection from genuine class
random_gen_ind = np.random.choice(gen_ind, n_fraud, replace = False)
random_gen_ind = np.array(random_gen_ind)
# merge two class indices: random genuine + original fraud
under_sample_ind = np.concatenate([fraud_ind,random_gen_ind])
# Under sample dataset
undersample_df = df.iloc[under_sample_ind,:]
y_undersample = undersample_df['Class'].values #target
X_undersample = undersample_df.drop(['Class'],axis=1).values #features
print("# transactions in undersampled data: ", len(undersample_df))
print("% genuine transactions: ",len(undersample_df[undersample_df.Class == 0])/len(undersample_df))
print("% fraud transactions: ", sum(y_undersample)/len(undersample_df))

By applying unsersampling we just left with 984 observations.

Okay let’s see the performance of algorithms now.

Undersampling performance

We are getting some good result by using Undersampling overall. But now their is a strong competition between Random Forest and DNN. Still random forest is still performing well among all.

Let’s see now its Precision-Recall graph

Precision-Recall graph of random forest after applying Undersampling

You can see it is doing a lot better now.

Let’s also see how it performance with oversampling.

7. Oversampling

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

Here we are adding some samples to fraud transactions to equalize the data.

drop_list = ['V13','V15','V22','V26','V25','V23']
Y = data['Class']
X=data.drop(['Class']+drop_list, axis = 1)
X_resample, y_resample = SMOTE().fit_resample(X, Y)
print(X_resample.shape)

That’s it our data is over sampled now. Total number of samples after oversampling are 568630.Now distribution of data is 50–50.

Oversampling performance

Above performance is definitely satisfying because in it all algorithms are doing significantly well but here notable thing is that we are getting approximately 100% accuracy in measures in Random Forest.

Another important thing to note that in Oversampling I am not using SVM because our data size is increased to 568630 samples to which if we apply random forest it will keep running endlessly.

Now let’s see precision-recall curve.

Precision-Recall curve of random forest after applying oversampling

It is showing as nearly perfect classifier for our problem.

Concluding statement is that after applying all the technique we are getting best results in Oversampling and the Best algorithm for this problem after analyzing the above results is Random Forest.

8. Summary

We have created 7 models to solve this problem and performed analysis on it and found that Random forest is by far the best according to the results we have got. That’s the journey for credit card fraud detection. 🍻

You can get entire code at my github repository here.

--

--