Fraud Detection using Machine Learning and dealing with Imbalanced Data
Credit Card Fraud — Every business’s nightmare!
For years, we’ve been using credit cards everywhere, for everything - be it at gas station, groceries or (not to forget the most popular one) online shopping.
And why not? It gives us instant access to major purchasing power without the inconvenience of carrying large amount of bills.
But, not to forget, it also comes with a cost. Every so often we hear/read about credit card scams and wonder “ Could that possibly happen to me?”.
The answer is “YES!”
According to Nilson Report, credit card fraud is projected to reach $35.67 billion by 2023.
And this billion figure doesn’t included expenses related to investigation costs, operations, call centres, chargeback management of fraudulent transactions and external recovery expenses borne by issuers and merchants. This alarming figure show that businesses are suffering monumental losses due to credit card fraud. The share of total business revenue lost to credit card fraud increased 279% between 2013 and 2016.
In a nutshell, the explosion in credit card fraud is massive and it continues to be a major threat to all businesses - a threat that has potentially crippling financial implications. Although businesses have been investing enormously in credit card fraud protection tools and technology, there’s no surefire way to extirpate it completely. But there are things one can do to minimize it and this is the objective of this piece.
The dataset used is taken from Kaggle and has credit card transactions made over a two-day period in September 2013 by European cardholders. It contains 284,807 transactions out of which 492 are frauds. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Each transaction has 30 features, all of which are numerical. Due to confidentiality reasons, the features
V1, V2, ..., V28 are the result of a PCA (principal component analysis) transformation. The features ‘Amount’ and ‘Class’ have neither been transformed or scaled. Feature ‘Amount’ is the transaction amount and ‘Class’ is the response variable that takes value:
- 1 if Fraud
- 0 if Non Fraud
In Machine Learning, fraud detection is a classic example of a classification problem where a class label is predicted for a given example of input data. In this dataset, the transactions are classified as ‘Fraud’ and ‘Non Fraud’.
One of the key challenges with a fraud detection dataset is the imbalanced data, as there is a severe skew in the class distribution. In our case, the majority of the transactions in the dataset are not fraudulent. This bias in the dataset can influence machine learning algorithms to ignore the minority class entirely and so it’s of utmost importance to first deal with the imbalanced data.
Dealing with Imbalanced Data
The most popular strategy of dealing with imbalanced data is to change the composition of the training set and this could be done by using data sampling techniques. These techniques transform the training set in a manner to better balance the class distribution. The machine learning algorithms can then be trained directly on this transformed dataset. There are several techniques in which one can deal with the imbalanced data. In this piece, we’ll discuss the following techniques:
All the above mentioned techniques are used by leveraging the imblearn library in Python.
Undersampling techniques deletes examples that belong to the majority class from the training set in order to better balance the class distribution.
A.) Random Undersampling
This is the simplest undersampling technique that involves deleting randomly selected majority class examples from the training set. A caveat of this technique is that it removes examples without taking into consideration how important they might be in determining the decision boundary between the classes.
B.) Cluster Centroid Undersampling
Cluster Centroid creates a cluster of the majority class examples and replaces them with the centroid of that cluster. In short, we undersample the majority class by forming clusters and replacing them with cluster centroids.
C.) Near Miss Undersampling
It’s a collection of undersampling techniques that select examples depending on the distance of majority class examples to minority class examples. There are three kinds of NearMiss technique: NearMiss-1, NearMiss-2, and NearMiss-3.
NearMiss-1 is the default version and is the one that has been used in the model. It selects examples from the majority class that have the smallest average distance to the 3 closest examples from the minority class.
More on undersampling techniques can be found here.
Oversampling techniques duplicate or adding more copies of examples in the minority class. This is one of the popular way to deal with imbalanced data and is useful when we do not have a lot of data.
A.) Random Oversampling
This is the simplest oversampling technique that involves duplicating randomly selected minority class examples in the training set.
B.) SMOTE (Synthetic Minority Oversampling Technique)
SMOTE creates synthetic examples from the minority class to obtain a synthetically class-balances training set. It applies KNN approach where it selects K nearest neighbors, joins them and creates the synthetic samples in the space.
3. Combined Under and Over Sampling
This technique uses both undersampling and oversampling methods together and can often result in better overall performance of a model.
A.) SMOTE and ENN(edited nearest-neighbour)
In this technique SMOTE is used with ENN. First, SMOTE is used to over-sample the data and then ENN method is used to remove instances of the majority class whose prediction made by KNN method is different from the majority class. It’s often seen as a data cleaning method.
SMOTETomek technique is also used as a data cleaning method. First, SMOTE is used to over-sample the data and later Tomek is used as an under-sampling method, but instead of removing only the majority class examples, examples from both the classes are removed.
I used Logistic Regression and Random Forest Classifier, in the model, to predict the number of fraudulent and non-fraudulent transactions. The Logistic Regression results are:
When dealing with imbalanced data, the overall classification accuracy is often not an appropriate measure of performance. A trivial classifier that predicts every case as the majority class also achieves very high accuracy.
We, therefore, use metrics such as precision score, recall score and PR AUC (Precision-Recall Area under the Curve) to evaluate the performance of algorithms on imbalanced data.
Based on the results shown above,
- The base model performed the best with 87.77% precision, 65.83% recall and PR AUC score of 0.768.
Let’s have a look at the Random Forest Classifier results:
- The base model here has 93% precision, 77.5% recall and an PR AUC score of 0.8526.
- With RandomOverSampler, there seems to be a slight improvement in the precision, a 2.5% improvement with the recall and a slight improvement with the PR AUC score as well.
- SMOTE decreased the precision by approximately 5% but recall improved by approximately 7% with a PR AUC score of 0.86.
- With SMOTETomek the precision decreased by approximately 4.5% but recall improved by approximately 6% and the PR AUC score is 0.859.
For any classifier, there is always a trade-off between TPR (True positive rate) and TNR (True negative rate); and also between precision and recall. In some situations, we might know that we want to maximize either recall or precision at the expense of the other metric. In this case, we can reduce the amount of False Negatives, however, it comes with a price as an increase in False Positives means that the non-fraudulent transactions would be classified as fraudulent, which would increase the operational costs for cancelling the card, issuing a new one and mailing it to the client and this could potentially lead to losing clients.
- Imbalanced data creates a severe skew in the class distribution and so it’s imperative to deal with it.
- Based on the results, Random Forest Classifier, along with different data sampling techniques, performed better than Logistic Regression.
- Each data sampling technique is unique and their implementation depends completely on the data and situation. In our case, RandomOverSampler, SMOTE, SMOTETomek techniques along with Random Forest Classifier performed well.
- In machine learning, trade-off between TPR (True positive rate) - TNR (True negative rate); and precision - recall is usual. Choosing the correct balance of precision and recall, depends on what problem one is trying to solve.
- Rafael Pierre - https://towardsdatascience.com/detecting-financial-fraud-using-machine-learning-three-ways-of-winning-the-war-against-imbalanced-a03f8815cce9