Imbalanced classification in Fraud Detection

Published in

Data Reply IT | DataTech

10 min readMay 31, 2022

Introduction

Working with imbalanced dataset could be a problem for some classic machine learning approaches, however, there are some situations where the natural distribution of the data between the classes is not equal. This is typical of Fraud Detection problems. Working with the following dataset from Kaggle we can see that there are a lot of legitimate transactions and only 0.17% of the total data are frauds. In classification problems where there is an imbalance of the data distributions, the research usually focuses on identifying rare data. The machine learning model’s performance should be measured mainly on the results obtained on the prediction of the minority class. It will be shown how to choose the right metrics for validating the model and the practical meaning of the Precision-Recall tradeoff in our situations. Finally, it will be highlighted how to deal with the skewed class distributions and see the results obtained with different machine learning algorithms.

If you seek for an analysis of Fraud detection based on user behavior, take a look here.

Why model explainability matters

In fraud detection, it is not only important the final prediction, but also the reason that leads the system to that conclusion. An automatic fraud classification system could be used by banks for detecting suspicious situations. After an alarm, bank employees will probably analyze the reason that led our system to identify a hypothetical fraud, and an expert will take the final decision. If a model cannot explain the reason, it could never be used by the authority or for legal purposes.

Of course, it is not always so important, but having a model that can explain the reason for the classification could be very important.

Which approaches?

Frauds can be detected with different approaches. Various researches show solutions based on neural/deep networks or classic machine learning algorithms. We are going to use the latter mechanism. Well.. ok, why?

Tools like neural networks usually are black-box models. Thus, it could be very hard trying to get the rules or the reason why a transaction has been classified in a certain way by the neural network, although, this information could be very important in fraud detection, as mentioned before. A trained tree-based model like a Decision Tree can give us the idea of the rules inferred by the system for each classification label ( Fraud or Non-Fraud). It is very easy, you have only to follow the path from the root to the leaf of the classification tree for understanding the classification criteria. Finally, neural networks are very powerful methodologies, but, sometimes using such tools is not necessary, a classic machine learning approach could be enough for having good results. We are going to train and test the following methodologies:

DecisionTree
Random Forest
Xgboost
Logistic Regression
Svm
K-nn

Skewed data distributions

A classifier could be robust or not to the skewness. The minority class could act as an outlier for the statistical model in imbalanced datasets, and outliers adversely affect a model’s performance. Some models are robust enough to handle outliers but in general, the problem is to be limited in which model is used. The challenge of working with imbalanced datasets is that, usually, machine learning techniques will ignore them, and in conclusion, it will lead to poor performance in the minority class, although typically it should be the most important result. Regarding this problem, a best practice could be to balance the data and do a process of skewness removal. Training the machine learning model on imbalanced datasets can affect our model, the risk is to overfit the dominating class while neglecting all other minor classes which are low in amount.

Balancing data distribution

This problem could be resolved by transforming the dataset into a new one with an equal number of elements of different classes. There are 2 possible solutions:

Undersampling the majority class.
Oversampling the minority class by creating a synthetic sample.

A third hybrid way could be applying both methods: over and under-sampling.

Why do these strategies work? With the undersampling strategies, the assumption is that some records of the majority class are redundant, however, the limit of this methodology is that by removing data we could remove relevant information. With the oversampling strategies, are created synthetics examples of the minority class till the distribution of the data among the classes is the same. This can balance the class distribution but does not provide any additional information to the model. The risk is overfitting the minority class.

Balancing algorithms

Random under/over-sample: The easiest (and less effective) way to do it is by removing random records of the majority class and duplicating random examples of the minority class.

SMOTE: It works by selecting examples that are close in the feature space, a random example from the minority class is first chosen. Then k of the nearest neighbors from that example are found (typically k=5). A synthetic example is created at a randomly selected point on the line between the chosen example and each of the k neighbors. The approach is effective because new synthetic examples from the minority class are created relatively close in feature space to existing examples from the minority class.

ADASYN: It is very similar to SMOTE. After creating those samples it adds a random small value to the points thus making it more realistic. Instead of all the samples being linearly correlated to the parent, they have a little more variance.

SMOTE TOMEK: Is applied SMOTE for the oversampling of the minority class. To reduce the outliers and try to obtain weel defined pattern, Tomek Link is applied both to the minority and majority classes. Tomek link is a data cleaning and outlier removal process. It removes 2 records of different classes if each of them is closer to the examples of the other class instead of one example of the same class.

( a ) original data set; ( b ) oversampled data set; ( c )Tomek links identification; ( d )borderline and noise examples removal.

SMOTE ENN: Like SMOTE TOMEK, the cleaning and outlier detection process is more effective. Given an example, it is based on a process of k-nearest neighbor for deciding if one record should be removed or not. If the cleaning process is too aggressive we can lose important information.

Evaluation metrics

In machine learning, there are different methods for evaluating a classification model. The most used metric is accuracy. It tells us how many instances are correctly classified among the total records. In scenarios of skewed data distributions, with high data imbalance between the classes, metrics such as Precision, Recall, F-score, and AUC are usually used.

Why accuracy is not the right metric? If we assume that in our imbalanced dataset there are 95 instances of normal transactions and only 5 frauds, a dummy prediction model that always predicts Non-Fraud, will obtain 95% of accuracy without discovering even one fraud. This is the reason why when we are working with an imbalanced dataset it is not used accuracy for the model evaluation.

Precision and Recall, are both important and tell us something different about the achieved results of the prediction system, these metrics are related to the Fraud class because is the one we are more interested in.

Precision. When one transaction is classified as Fraud, the value of the precision can give us the idea of how the system is confident that if it predicts fraud, the transaction is a real fraud. Here is the precision formula:

Precision Formula

Recall. The value of the recall can give us an idea of how the system is confident in catching all the fraud. Here is the recall formula:

Recall formula

The implication of the Precision-Recall tradeoff in Fraud detection

The perfect prediction model should detect all the fraud in the dataset and be sure that each predicted fraud is a real fraud. Translating it with our metrics means having high Recall ( detecting all the fraud ) and high Precision ( less error as possible on the transactions predicted as fraud ). Precision-recall is a trade-off, while you are trying to optimize the precision, the value of the recall will decrease and vice versa.

What does it mean to optimize the Precision? We’ll be very confident that a predicted transaction as fraud is a real fraud, we will tune the parameters of our classifier in a way that it will predict as fraud only the transactions whit a high probability to be real frauds. The drawback is that some fraudulent transactions, maybe the ones that are most similar to non-Fraud, won’t be detected. The system will lose some fraudulent transactions classifying them as normality and this could be a problem.

What does it mean to optimize Recall? We’ll be very confident that we are detecting all the frauds, we will tune the parameters of our classifier in a way that it will detect as much fraud as possible. The drawback is that in trying to detect all the fraud also the normal transactions that are similar to fraud will be classified as it.

Conclusion about the trade-off

Precision-recall is a trade-off, preferring precision optimization we are detecting less fraud and we are decreasing the recall. Preferring the recall optimization we are detecting more fraud but could happen that some of them are misclassified and the value of the precision will decrease. You can prefer one metric instead of another, but it is important to always keep in mind both. For example, a model that reaches 99% on precision and 15% on recall, it is not a good predictor because even if it has a very high precision rate, it can’t detect 85% of the fraud !!

We don’t know which metric is more important than the other, it depends on the specific use-case our model is used for. In some cases can be more important to detect all the fraud and doesn’t matter if there are false alarms ( recall optimization ) in other cases we prefer to have fewer alarms but with high precision.

Training and Test split with Imbalanced dataset

Using balancing algorithms (e.g. SMOTE), we are going to modify the original dataset, we are creating new examples, and remove others. This manipulation if not done properly could bring to data leaking and we want to avoid this scenario. The train-test splitting process must be done before data manipulation ( like balancing ). In this way, the test set will be highly imbalanced, and this is ok because we want to test our model as if we were in a real scenario. We want to balance the data just for the training process.

Results

Before showing the results we have to understand what type of information we are searching for. Answering some easy questions can help us to understand in which direction our research can continue.

Is the balancing method effective? Always or only in some situations?
Which is the best classification method and what’s about the performance?

Balancing effect

The model has been trained with different machine learning algorithms: DECISION TREE, RANDOM FOREST, XGBOOST, SVM, LOGISTIC REGRESSION, K-NN.

The balancing process has been done with all the methodologies previously shown: RANDOM, SMOTE, ADASYN, SMOTE ENN, SMOTE TOMEK, IMBALANCED.

We are going to show all the possible combinations, if you are not interested in going deep in the analysis of each trial you can jump to the discussion of the results where I show the most interesting results and the conclusions.

RANDOM FOREST

DECISION TREE

XGBOOST

SVM

LOGISTIC REGRESSION

K-NN

Balancing effectiveness

Some of the shown approaches reach good results both with imbalanced and balanced data. Decision Tree and Logistic regression have good results with imbalanced data. SVM benefits from balancing process gaining till +54% on precision.

An interesting result is the one obtained with Random Forest, XGBoost, and K-nn. The balancing process leads to an increase in recall and it seems reasonable. During the balancing process, we are going to increase the number of Fraud the model is going to see for the training. In this way, it will have more information about what a Fraud looks like and it will recognize more of it.

Best results

We are going to show the most interesting results reached with the different approaches.

The best results are obtained with Random Forest and XGBoost. Also, K-NN has interesting results but we prefer the other model because the predictional model could be used in a real-time system and k-nn is not the best solution due to the computational complexity for each classification. Logistic regression and SVM have two different behaviors. The first one has good performance on the precision and the second one on the recall. Decision tree is very good with imbalanced data and the results are getting closer to the one of the best classifiers.

The best classifiers are Random Forest and XGBoost. As discussed previously, the balancing process could be an important tool if we prefer the optimization of one metric instead another. The balancing process leads to an increment of the recall, an important result if we want to detect more fraud.

User Behavior Analysis

One important analysis that is possible to do is the one based on the user’s behavior and habits. I am going to show you the power of these methodologies in: Fraud Detection Modeling User Behavior, take a look if you are interested.

I hope that the article was interesting, thanks for reading, and if you liked it, leave a clap 👏