# Financial Transaction Fraud Detection

## Logistic Regression, TensorFlow Keras, or XGBoost

- Australia lost
**$574 Million**on fraudulent transactions in 2018, out of which $487.5 Million occurred by card-not-present channels. - In the same year,
**$24.26 Billion**was lost due to payment card fraud worldwide. - In FRAUD THE FACTS 2019, the UK government reported that “unauthorised financial fraud losses across payment cards, remote banking and cheques totalled
**£844.8 million**in 2018, an**increase of 16 per cent 😱**compared to 2017”.

Above mentioned matters explicitly express the substantial importance of fraud detection capabilities in banking and financial sector. Nonetheless, it is unfortunate that financial institutions are reluctant to switch to more advanced technologies such as machine-learning and deep-learning engines due to restrictions posed by regulatory parties and banks have sticked to original (and fair-to-say well-proven) rule-based systems.

Hopefully🤞, thanks to recent computing enablement and data availability, we’ll soon witness changes in this trend.

Here, I’m going to compare the performance of multiple tools, which are also mathematically different, in detecting fraudulent transactions. For that, I have used dataset provided by Machine Learning Group — ULB as part of Credit Card Fraud Detection Data on Kaggle.

The datasets contains transactions made by credit cards in September 2013 by European cardholders which presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for0.172% of all transactions. This matter is the root of extreme complexity in this topic — extremely imbalanced data that I have explained how to overcome further down.

Notes on the current article:

- For the first time, I used Google Colab to write Python code. However, I can’t highlight any wow in the experience as oppose to Jupyter Notebook. The reason I used Colab was TensorFlow mainly.
**Logistic Regresion**,**tf.Keras**, and**Xgboost**algorithms are used to predict fraudulent transaction and**results are compared in terms of precision and recall**.

Complete python code is available on my Github.

Let’s jump into it 😏

# 1. Import data from Google Drive

Load in the `csv`

file from my Google Drive and save into a Pandas data frame.

# Code to read csv file into Colaboratory:!pip install -U -q PyDrivefrom pydrive.auth import GoogleAuthfrom pydrive.drive import GoogleDrivefrom google.colab import authfrom oauth2client.client import GoogleCredentials------# Authenticate and create the PyDrive client.auth.authenticate_user()gauth = GoogleAuth()gauth.credentials = GoogleCredentials.get_application_default()drive = GoogleDrive(gauth)

In the process of importing the `csv`

file from Google Drive, you need to enter the file `ID`

. Look at the 3rd method introduced here if you’d like to learn the details.

#you need to enter your file idid = '1grwIZR_LdcdyirULSoJ_VFhtuPpv00AB'----

downloaded = drive.CreateFile({'id':id})downloaded.GetContentFile('Filename.csv')----

data = pd.read_csv('Filename.csv')# Dataset is now stored in a Pandas Dataframe

Our data has **284,807 records in 31 columns** out of which 30 columns encompass independent variables which theoretically explain the changes in our dependant variable. In our scenario, dependant variable is a binary column to show whether a transaction was fraudulent or genuine.

Based on the data description (image below), there’s a variety of ranges in our data set. As a result, Data Normalisation is required to change the values of numeric columns in the dataset to a common scale.

# 2. Scaling the data frame

I separated dependant variable from independent variables. Remember that normalisation is only done on independent variables.

X_data = data.iloc[:,0:30]y_data = data.iloc[:,-1]

… and then, used **Standard Scaler** to normalise** training** dataset.

standard_scaler = preprocessing.StandardScaler()X_standard_scaled_df = standard_scaler.fit_transform(X_data)

# 3. Feature Extraction

Obviously in ML algorithms you are dealing with the variance-bias trade-off and it’s always a challenge to overcome overfitting problem. I used a couple of methods and found out **Principal Component Analysis** contributed to the best output in terms of dealing with the trade-off.

Towards Data Science has provided a brief and effective explanation on **PCA****. **If you are on your outset to become a data scientist, I suggest you have a look…good stuff!

# Make an instance of the Modelpca = PCA(10)

# fit and transform data frame in one jumppca_selected = pca.fit_transform(X_standard_scaled_df)

Result of PCA is interesting! There are 10 features extracted out of the function. I convert the result into a Pandas data frame and have a look at the first 5 rows.

**4. Train and Test Data Split**

As mentioned before, the trickiest point about fraud data sets is the extreme imbalanced distribution of positive and negative instances. For instance, let’s have a look at our current data set by using `.value_counts()`

function and then illustrating it.

Results show that there are only 492 instances of fraudulent transactions out of the total 284,807 records which holds only %0.1727 of all the samples.

*There it goes splitting an extremely imbalanced data set…*

Since I need to make sure that the handful number of positive class is distributed proportionally in both training and test data, I take the following steps:

- separate all the positive records from negative

data_class_0 = ready_data[ready_data['Class']==0]data_class_1 = ready_data[ready_data['Class']==1]

2. spilt each class into train and test set, %67 to %33 respectively

# Since the number of fraud transactions are too little in compare to non-fraud,# I make sure that they are distributed proportionally in both train and test setX_0 = data_class_0.iloc[:,0:-1] #independent columnsy_0 = data_class_0.iloc[:,-1] #target column i.e ClassX_1 = data_class_1.iloc[:,0:-1] #independent columnsy_1 = data_class_1.iloc[:,-1] #target column i.e ClassX_train_0, X_test_0, y_train_0, y_test_0 = train_test_split(X_0, y_0, test_size=0.33, random_state=42)X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, y_1, test_size=0.33, random_state=42)

3. Join them back to have one train and one test set which proportionally encompass both positive and negative class

X_train = pd.concat([X_train_0, X_train_1])y_train = pd.concat([y_train_0, y_train_1])X_test = pd.concat([X_test_0 , X_test_1])y_test = pd.concat([y_test_0 , y_test_1])

At this point, if we look at data set, we have have the following:

Considering the imbalanced data set, next step is to balance our independent training set `X_train`

.

# 5. Balance training data set

Before getting into what I did, you might like to have a look on the concept of **Over-sampling** and **Under-sampling** in general. Here is a good explanation by “Machine Learning Mastery”.

I used

function for balancing current model. In a nutshell, SMOTE corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Check here for full documentation. To do so, we simply enter:**SMOTE()**

sm = SMOTE(random_state=42)X_res, y_res = sm.fit_resample(X_train, y_train)

Now if we compare the dataset before and after

, here we see the magic.**SMOTE**

Output:Original dataset shape Counter({0: 190491, 1: 329})Resampled dataset shape Counter({0: 190491, 1: 190491})

At this stage, we have (…finally 😫) got data set ready for modelling.

# 6. Fraud Detection Model

For this purpose, I used 3 algorithms and compared their results:

## 6.1. Logistic Regression

As discussed above, logistic regression is the most highly accepted and used algorithm among the others in the real-life banking industry.

logisticRegr = LogisticRegression()logit_model = logisticRegr.fit(X_train, y_train)logit_predict = logisticRegr.predict(X_test)

Here is the accuracy output of Logistic Regression:

In:

print(classification_report(y_test, logit_predict))Out:

## 6.2. Deep Learning Neural Network — TensorFLow.Keras

The second algorithm is the Artificial Neural Network for which I used TensorFLow Keras.

Input and hidden layers

- Looking at the code on my Github, you’ll see that I tried building the neural network twice. Once, I set the units = 10 for hidden layers and second time, set it =32 (which is more common practice). The network with 32 units of hidden layer resulted in higher accuracy.
- I did also play with the number of epochs which is another influential hyper-parameter for managing overfitting problem. I found 10 epochs end up with an overfitted model which 5 was reasonably acceptable.
- The other hyper-parameter is Activation Function. At the most basic level, an activation function decides whether a
**neuron**should be fired or not. It accepts the weighted sum of the inputs and bias as input to any activation function.,*Step function*,*Sigmoid*,*ReLU*, and*Tanh*are examples of activation functions. MissingLink has provided a good summary of their story in 7 Types of Neural Network Activation Functions.*Softmax*

Output layer

The output will be a one binary layer for which I set the Sigmoid(👆) activation function.

# Initialising the ANN

classifier = keras.Sequential()

# Adding the input layer and the first hidden layer

classifier.add(keras.layers.Dense(units =32 , kernel_initializer = 'uniform', activation = 'relu', input_dim =10))# Adding the output layer

classifier.add(keras.layers.Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))# Compiling the ANNclassifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])# And finally

# Fitting the ANN to the Training set

model = classifier.fit(X_train.values, y_train.values, batch_size = 128, epochs = 5)

# Predicting the Test set results

y_pred = classifier.predict(X_test)

y_pred = (y_pred > 0.5)Output:

93987/93987 [==============================] - 3s 31us/sample - loss: 0.0041 - accuracy: 0.9992

Let’s see how our model performed

`print(classification_report(y_test, y_pred))`

A couple of extras:

**Confusion Matrix**

**ROC AUC Diagram**

## 6.3. XGBoost

XGBoost stands for e**X**treme **G**radient **B**oosting and is the next level of features of the scikit-learn and R implementations, with new additions like regularisation. Again, have a look at MachineLearningMastery for further explanations.

XGBoost is,

an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.

an implementation of gradient boosted decision trees designed for speed and performance.

There are two important hyper-parameters that you need to have eyes on which are `learning_rate`

and `n-estimators`

. They help you challenge overfitting problem and improve accuracy. My learning was %1 learning rate produced and better “recall” as oppose to %10. And to be honest, I believe 10,000 `n_estimators`

was a overkill but I did it anyway 😅.

# Learning rate = 0.01XGB_classifier = XGBClassifier(n_estimators=10000, learning_rate=.01, maximize=True)XGB_classifier.fit(X_train,y_train, eval_metric = 'aucpr')

Output:

Next is using the model to predict dependant variables based on independent test data…

`XGB_classifier_predict_smote = XGB_classifier.predict(X_test)`

… and compare the prediction with actual dependent test set.

`classification_report(y_test,XGB_classifier_predict_smote)`

Finally goes the result 👇

# Comparing results and *conclusion*

“what is the best accuracy measure in our scenario?”

**hmmm… 🤔 any idea?!**

Why I even posed this question? Because remember…banking is a sensitive industry and there’s huge loss every year.

To answer this question, I would like to refer you back to our friends who build the Confusion_Matrix… **the four (in)famous TP, FP, TN, and FN**.

If you think about which one of the four is most important to us, you’d know what the best accuracy measure is. Basically, we are dealing with fraudulent transaction, it’s critically important that we do not flag a fraud transaction as genuine. The opposite (flagging a non-fraud as fraud) is also costly but not as critical.

Therefore, **priority would be minimising the number of False Negatives.** Meaning to say that *minimising the number of “fraudulent transactions which are marked as genuine”*. Accordingly, we know that we have to improve **Recall**’s amount as high as possible.

This Idiot’s Guide to Precision, Recall and Confusion Matrix helped me big time. You can also have a look if you feel like you’re not following what I’m saying here.

Comparingresults of the three algorithms, we see thathasn’t changed drastically while there’s been aPrecisionsubstantial improvementinfrom Logistic Regression to tf.Keras and from tf.Keras to XGBoost.Recall

… and the prize🥇🏆goes to Xgboost for its humble, mathematically powerful gradient boosted engine.