5 min readApr 15, 2018

CREDIT CARD FRAUD ANALYSIS IN RANDOM FOREST ALGORITHM

Introduction

There are number of transactions made everyday with most of them being legal and few being fraud transactions. To find the fraud transactions, The banking sector has already implemented algorithms to find the suspicious transactions with most common approach called anomaly detection (Outlier detection).

The data set(Source: http://mlg.ulb.ac.be) consists of 284,807 transactions in which 492 are identified as fraud transactions made by European cardholders in September 2013. Due to privacy concerns, PCA(Principal Component Analysis- Reduction in dimension of data without the variation present in data) is performed to obtain 28 principal components(V1 to V28) with variables time and transaction amount which are not transformed.

Objective

Our objective is to construct a model to predict which transactions could be fraudulent with high accuracy.

Plan

We have an imbalanced data set with our response variable “class” which has skewed distribution. Imbalance in data leads to under fitting or over fitting of a model. This is frequently faced in classification problems.

To build a perfect model, we should have a balanced data set to achieve higher accuracy which will be done in R.

This report deals with classification algorithm called random forest model to detect fraud transaction in R(Programming language).

Processing

Part 1: Data Exploration

All components are PCA-ed and there is no need of data cleaning.But the real world data sets are messy where data cleaning or munging is a big task.

As far programming is concerned,We have to start with the loading of necessary libraries to accomplish our task.

library(unbalanced) # contains SMOTE method (over samples by using bootstrapping and k-nearest neighbor to synthetically create additional observations) to generate synthetic examples of the minority class in an unbalanced-class data set.
library(caret) #for classification and regression training
library(dplyr) #for manipulating data set
library(ROCR) #for visualizing the Performance of Scoring Classifiers. ROC graphs sensitivity/specificity curves, lift charts, and precision/recall plots
library(randomForest) #for Classification and regression
library(rpart.plot) # for plotting rpart trees
#Now,that we have loaded necessary libraries to analyse for our data,Let us load the data
credit <- read.csv(“C:/Users/Maddy/Desktop/Creditcard/creditcard.csv”)
#Our data is approximately 144MB , so it will take few seconds instead of loading immediately

Explanation for variables:

1)Time: First and current transaction timing elapse

2) V1…V28: Principal Components through PCA (Principal Component Analysis(Dimensionality-reduction technique)

3)Amount:Transaction Amount

4)Class: Response Variable(1-Fraud,0-Legal)

#Since we have loaded the data,we should check whether it is balaced or not
table(credit$Class)/nrow(credit)
# 0 1
#0.998272514 0.001727486
#The output tells us the data is imbalanced.
#To balace the data,we can use ubsMOTE() function from unbalanced library
?ubSMOTE #synthetic minority over-sampling technique
balance <- ubSMOTE(X = credit[,-31], Y = as.factor(credit$Class),
perc.over=200, perc.under=800, verbose=TRUE)
balancedf <- cbind(balance$X, Class = balance$Y)
table(balancedf$Class)/nrow(balancedf)
# 0 1
#0.8421053 0.1578947
Now balancedf dataframe has balanced data

2. Data Exploration-Visualization
We should always plot the data to learn more about it with clarity.

pdf(“credit.doc”)
for (i in seq(from =1, to = 30, by = 4))
{
show(
featurePlot(
x = balancedf[, c(i+2,i+3,i,i+1)],
y = balancedf$Class,plot = “density”,
adjust = 1.5, pch = “|”, layout = c(2,2 ), auto.key=TRUE
)
)
}
dev.off()
Outputs for our comparison is as below

As we can see from the plotting, columns v15,v20, v22,v24,v25 and v26 have less impact on our target.

Hence, we can exclude these columns and divide the data into 80% and 20% as training and testing set.

newdata <-balancedf[,-c(16,21,23,25,26,27)]
sample<- sample(2, nrow(newdata),
replace = T,
prob = c(0.8,0.2))
train <- newdata[sample==1,]
test <-newdata[sample==2,]

Now we have split the data and we are ready to apply the random forest algorithm to our balanced data

balancedf.rf <- randomForest(Class~.,train,
ntree=300,
importance=T, do.trace=T)
balancedf.rf$confusion
# 0 1 class.error
#0 6204 10 0.001609269
#1 117 1086 0.097256858
plot(balancedf.rf)

***Random Forest Plot with 300 trees.***

The plot seems to indicate that after 50 decision trees, there is no significant reduction in error rate.

AUC(Area Under Curve) Graph:

we have a ROC (Receiver Operating Characteristics) curve to measure the accuracy of a classification prediction.

ROC Curve is formed by plotting TP rate (Sensitivity) and FP rate (Specificity).

The larger the area under ROC curve, higher will be the accuracy.

If we are more focused on accuracy we should try many suitable algorithms for solving the problem.we can use stacked generalization as an approach to achieve it.

“Stacked Generalization or stacking is an ensemble technique that uses a new model to learn how to best combine the predictions from two or more models trained on your data set.”

ROSE: (Generation of synthetic data by Randomly Over Sampling Examples )

> fit1 <- rpart(Class~., data = train)
> pred.fit1 <- predict(fit1, newdata = test)
> head(pred.fit1)
0 1
106201 0.9737249 0.02627512
93369 0.9737249 0.02627512
10566 0.9737249 0.02627512
45402 0.9737249 0.02627512
723 0.9737249 0.02627512
180962 0.9737249 0.02627512
> roc.curve(test$Class, pred.fit1[,2], plotit = T)
Area under the curve (AUC): 0.924

Results:

· Fraud transactions are relatively small compared to original transactions.

· Random-forest model built on the credit card data set showed 92.66% accuracy to predict fraud transaction

· ROSE model gives similar or slightly lower accuracy of 92.4%

Model can be build in Neural networks to have even more accuracy for these type.

Credits: James Chen,@Krishna Shah,Medium community and Kaggle community.

P.S:I am a newbie and a constant learner exploring more on Machine learning.

Explanation for variables:

Results:

Written by Madhan Balasubramanian