Handling Imbalanced Datasets: Predicting Credit Card Fraud

Published in

Coinmonks

7 min readJul 26, 2018

As data scientists, we will come across various types of datasets. One such dataset is an imbalanced data set. This type of dataset always poses a problem for beginner data scientists as most machine learning algorithms are bad at handling it.

What is an imbalanced dataset?

An imbalanced dataset is one where the number of observations belonging to one group or class is significantly higher than those belonging to the other classes.

This occurs in cases such as credit card fraud detection where there might be only 1000 fraud cases in over a million transactions, representing a meager 0.1% of the dataset. The identification of rare diseases is another possible case of dealing with imbalanced data.

Challenges of dealing with Imbalanced datasets

Machine Learning algorithms are very likely to produce faulty classifiers when they are trained with imbalanced datasets. These algorithms tend to show a bias for the majority class, treating the minority class as a noise in the dataset. With many standard classifier algorithms, such as Logistic Regression, Naive Bayes and Decision Trees, there is a likelihood of the wrong classification of the minority class.

There is also a problem of vanity metrics in measuring the performance of algorithms on imbalanced datasets. If we have an imbalanced dataset containing 1% of a minority class and 99% of the majority class, an algorithm can predict all cases as belonging to the majority class. The accuracy score of this algorithm will yield an accuracy of 99% which seems impressive, but is it really? The minority class is totally ignored in this case and this can prove expensive in some classification problems, such as the case of a credit card fraud, which can cost individuals and businesses lots of money.

Methods of Handling Imbalanced Datasets

There are two major methods of handling imbalanced datasets and they are discussed below:

Oversampling: This method involves reducing or eliminating the imbalance in the dataset by replicating or creating new observations of the minority class. There are four types of oversampling techniques:

Random Oversampling: In this case, new instances of the minority class are created by randomly replicating existing samples in order to increase the minority count in the dataset. This method however can lead to overfitting as it simply replicates the already existing instances of the minority class.
Cluster-based oversampling: Here, the K-means algorithm is applied separately to both the majority and minority instances. This helps identify the clusters in the dataset. After the identification, each cluster is oversampled such that all clusters have the same number of observations. Again, there is a risk of overfitting the model with this method.
Synthetic Oversampling: This method helps to avoid overfitting. In this method, a small subset of minority is chosen and synthetic examples of this subset are created to balance up the overall dataset. This adds new information to the dataset and increases the overall number of observations.
Modified Synthetic Oversampling: This is just like the synthetic oversampling method, however this one makes room for the noise and inherent distributions of the minority class.

2. Under-sampling: In this method, the imbalance of the dataset is reduced by focusing on the majority class. One popular type is explained below:

Random Undersampling: In this case, existing instances of the majority class are randomly eliminated. This technique is not the best because it can eliminate information or data points that could be useful for the classification algorithm.

Measuring algorithm performance on an Imbalanced Dataset

Since we have established that accuracy is a poor measure of performance on imbalanced datasets, how then do we measure performance? In order to establish this, let us first define some terms:

False Positive (FP): This is used to describe positive predictions that are actually negative in real life. An example will be predicting a credit card transaction is fraudulent when in truth it is not.

True Positive (TP): This is used to describe positive predictions that are actually positive in real life.

False Negative (FN): This is used to describe negative predictions that are actually positive in real life. An example will be predicting a credit card transaction is not fraudulent when in truth it is.

True Negative (TN): This is used to describe negative predictions that are actually negative in real life.

Now that we have that out of the way, let us look at possible performance measures.

Some Measures and indicators of model performance include —

Precision: This is an indicator of the number of items correctly identified as positive out of total items identified as positive. Formula is given as: TP/(TP+FP)

Recall / Sensitivity / True Positive Rate (TPR): This is an indicator of the number of items correctly identified as positive out of total actual positives. Formula is given as: TP/(TP+FN)

Precision can be seen as “how useful the results are”, and recall is “how complete the results are”

Specificity / True Negative Rate (TNR): This is an indicator of the number of items correctly identified as negative out of total actual negatives. Formula is given as: TN/(TN+FP)

F1 Score: This is a performance score that combines both precision and recall. It is a harmonic mean of these two variables. Formula is given as: 2*Precision*Recall/(Precision + Recall)

Matthew Correlation Coefficient Score: This is a performance score that takes into account true and false positives, as well as true and false negatives. This score gives a good evaluation for imbalanced dataset. Formula is given as:

Area Under ROC Curve: An ROC curve (receiver operating characteristic curve) is a graph that shows the performance of a classification algorithm at all classification thresholds. This curve plots TPR vs. FPR.

Area Under Precision-Recall Curve: A Precision-Recall curve is graph that shows performance by plotting precision against recall.

Confusion Matrix: This is a graphical representation of the TP, FP, FN and TN. A generalized confusion matrix is given below:

Generally, the best measures for imbalanced datasets are: Matthew Coefficient Correlation Score, F1 Score and The Area under the Precision-Recall Curve.

Still, it is important that you understand the business implication of the model when looking to choose a performance evaluation metric. Ideally, there would be trade offs between false positives and false negatives in real life. For instance, if you are classifying against credit card fraud, you may prefer to have a false positive than have a false negative. That is, you would rather predict that a transaction is fraudulent when it actually is not, than predict that a transaction is not fraudulent when it is. Bear in mind that excessive false positive predictions could lead to bad customer experience, so you may have to consider that as well. Also, if you are building a movie recommendation model, for instance, you may prefer to have false negatives than false positives. That is, you would rather tell a person they do not like a movie when they actually do, than tell a person they will like a movie when they do not.

Essentially, the business implications of these metrics need to be taken into account before settling for one.

Practical Example: Credit Card Fraud Prediction

This practical example utilizes an anonymized credit card transactions dataset.

The ratio of non-fraudulent transactions to fraudulent transactions was a whooping — 99.83% to 0.17%

I proceeded to modelling, selecting Logistic Regression as my algorithm. I trained the algorithm with the unbalanced dataset. Below was my result:

Notice that the accuracy is very high, despite the presence of false positives and false negatives. However, the F1 score and MCC score tell a better story of the actual model performance

I then proceeded to balance the dataset.

I used SMOTE (synthetic minority oversampling technique) to balance the dataset.

I then trained the algorithm with the balanced dataset, tested with the original data and obtained these results:

As we can see, this model is much better than the former one.

There you have it! I hope you enjoyed reading this half as much as I enjoyed writing it. If you did, please clap for the post and share it. You should also follow me and check out my other posts.

Handling Imbalanced Datasets: Predicting Credit Card Fraud

Written by Kelechi