Fraud prediction; a challenge for machine learning algorithms

Zinonas Zinonos
Analytics Vidhya
Published in
11 min readJul 23, 2020

Introduction

Fraud is a billion-dollar business and expands rapidly year by year. Thousands of people fall victim to it. Fraud always includes a false statement, misinterpretation, or deceitful conduct. Common varieties of fraud offenses include identity theft, insurance fraud, credit/debit card fraud, and mail fraud.

The PwC global economic crime survey of 2018 (PwC, 2018) found that about half of the 7,200 surveyed enterprises had already experienced fraud of some kind. This is an increase compared to the PwC survey conducted in 2016 (PwC, 2016), in which slightly more than a third of organizations surveyed had experienced economic crime. According to Infosecurity Magazine (Infosecurity, 2018), fraud cost the global economy almost 4 trillion USD in 2018. For a number of businesses, losses due to fraud reached more than 10% of their total spending. Such massive losses push companies to search for more efficient solutions to detect and prevent fraud.

Putting fraud under control is a headache particularly for the banking and commerce industry. The number of transactions has increased due to a variety of online payment options, such as credit or debit cards, smartphones payment applications, online payment sites, etc. At the same time, cybercriminals have become adept at finding and exploiting ambiguities or inadequacy of the online payment systems. As a result, it is getting tougher for businesses to authenticate and approve transactions.

Machine learning is the most promising technological weapon to fight financial fraud. The adoption of machine learning (ML) has been accelerated with the increasing processing power of big data and advancements in statistical modeling. Thanks to the rapid development of machine learning science, data scientists can mitigate this problem and make prompt and accurate predictions. Automated fraud detection systems powered by machine learning can help businesses in substantially reducing fraud.

In this paper, machine learning models are benchmarked on a challenging large-scale dataset of online transactions. The machine learning models chosen for this study are from the SAP Predictive Analytics business intelligence software (SAP, 2020), the SciKit-Learn open-source libraries (Sklearn, 2020) and the Microsoft LightGBM framework (LGBM, 2020). A dataset from real online purchases including fraudulent transactions is used to train all Machine Learning algorithms.

Objective

The study refers to a binary classification problem, namely the target variable is a binary attribute characterizing an online transaction as fraudulent or non-fraudulent. Before proceeding with machine learning for training and evaluation, data are preprocessed and analyzed. Preprocessing is necessary to address typical problems with raw input data, such as rejecting empty data rows, fixing missing values, one-hot encoding, anomaly detection, and feature scaling and normalization. The clean data are then piped to machine learning algorithms for training and evaluation. The results are interpreted through the Area Under the Receiver Operating Characteristic curve (AUC-ROC), which is constructed using the true positive (TP) rate as a function of the false positive (FP) rate for varying thresholds of the classifying algorithm. This metric is finally used to compare the performance of the algorithms involved in this study.

Data

The dataset is provided by Vesta Corporation (Vesta, 2020) and describes real-world, anonymized e-commerce transactions (IEEE-CIS, 2019). Each of the 590,540 data entries contains a wide range of features from device type to product features, offering thus the opportunity to create new features and improve training results. More specifically, each of the transactions is characterized by 200 features including information about the transaction and the payment party (see Table 1). The data columns are subdivided into 35 categorical and 165 numeric. Additionally, the data deliver temporal information about the transaction in the format of time difference from a specific date-time reference.

Table 1 The main feature categories of the fraud transactions dataset.

The fraudulent rate in the given dataset is measured to be 3.53%, indicating that the classification problem is highly imbalanced. Emphasis is, therefore, placed on avoiding a trained model that assumes rather detects actual signatures of fraud.

Random Forests

Random Forests are trained for generating predictions. They are a type of supervised learning algorithm based on an ensemble learning method. In ensemble learning techniques, different types of algorithms or multiple versions of the same algorithm are combined to form a more powerful predictive model. The Random Forest algorithm can be trained for classification or regression tasks by constructing a multitude of decision trees at training time, resulting in a forest of trees. While deep decision trees may suffer from overfitting, Random Forests prevent overfitting by creating decision trees on random subsets. At the cost of being computationally slow, Random Forests are suitable for imbalanced classification problems as they are considerably less prone to large variance effects. Random Forests are also extremely stable; the algorithm is not affected much by introducing new data in the dataset and it performs well with missing or outlying values. Moreover, they work well with unscaled and unnormalized data. The algorithm configuration is summarized in Table 2, which shows the main training hyperparameters of the Random Forests algorithm using the Sckikit-Learn suite.

Table 2 Tuning parameters used for training the Random Forests algorithm

Before training Random Forests, a meta-transformer algorithm is executed to select features on importance weights. This is a tree-specific feature importance measure and computes the average reduction in impurity across all trees in the forest due to each feature. Features with an importance score less than 0.002, i.e. that tend to split nodes far-off from the tree root, are considered unimportant and hence removed from the algorithm training, since they may incommode the process of obtaining an efficient and accurate predictive modeling.

Figure 1 Stratified cross-validation method using k number of folds for the data partitioning. In each iteration, k-1 folds are used to train the model and the remaining “holdout” fold to validate its trained abilities. The folds are built up by preserving the percentage of samples for each class. For every performance quantity, Q, for example, accuracy or AUC-ROC, the average is calculated based on the values obtained during the k-fold iterative procedure.

The k-Fold cross validator of Scikit-Learn is used to split the labeled data into train and test sets. The dataset is split into k=5 consecutive stratified folds preserving a therefore equal portion of samples of each class. Each fold is then used once as a validation set, whereas the k-1 remaining folds form the training set. After training, the ROC and AUC-ROC are computed. The larger area under the curve is, the better model is achieved since it is ideal to maximize the TP rate and minimize the FP one. Figure 2 presents the ROC for the 5-fold training session of the Random Forests. The mean AUC-ROC and accuracy were found to be 0.886 and 0.974, respectively. After a careful comparison of the performance of the train and validation sessions, no sign of model overtraining was observed.

Figure 2 Receiver Operating Characteristic curves for 5-fold training of a Random Forests algorithm. The ROC of each training session is represented by a different color as indicated in the legend along below with the area under the ROC curve (AUC-ROC). The red dashed in the diagonal shows the ROC curve of a random and unskilled predictor with AUC-ROC equal to 0.5.

Precision-recall is an important measure of success of prediction when the classes of a binary problem are quite imbalanced. Precision is a measure of result relevancy, whereas recall is a measure of how many truly relevant results are returned. Figure 3 presents the precision-recall curve for the current fraud classification problem, showing a trade-off between the two quantities for different classification thresholds. A high precision relates to a low FP rate and high recall indicates a low false-negative rate (FN). High scores for both measures show that the classifier is returning accurate results (high precision), as well as returning most of all positive results (high recall). Another useful quantity is the F1 score, which is defined as the harmonic mean of precision and recall. The F1 score is represented by the contour lines in Figure 3, reaching its best value at 1 and the worst score at 0.

Figure 3 Precision-Recall curve for the 5-fold trained Random Forests predictor for the fraud binary classification. The contour lines represent the F1-score.

The most impactful feature in the training is found to be C1, which records one of the characteristics of the transaction process initiated by the online customer. Figure 3 shows the distribution of the different values that C1 can acquire. Fraud transaction events tend to be associated with values located in the rightmost bins of this transactional feature.

Figure 4 Example of a C-feature distribution. The C1 variable captures one of the characteristics of the transaction process and is found to rank first during the training session of the Random Forests classifier. The left-hand axis counts the actual online transactions while the continuous line, associated with the right-hand axis, shows the percentage fraction of fraudulent transactions in each bin. Transactions with high C1 values are more subjective to online deception.

SAP Predictive Analytics

The SAP Predictive Analytics a business intelligence software from SAP that is designed to enable enterprises to analyze large datasets and predict future outcomes and behaviors. SAP Predictive Analytics facilitates users without a deep background in machine learning and data processing to analyze and data via fully managed data-mining programs. Users can perform complex data analysis and visualize their models to graphically represent features from data. This software suite can additionally automate data preparation, predictive modeling, and scoring, which can help business users analyze data without manual intervention. SAP Predictive Analytics can automatically apply inference on unseen data and deploy the trained model for operations on a real-time business setup.

Once data are fed into the Data Manager of SAP Predictive Analytics, the former iteratively trains a classification or regression algorithm based on two main metrics:

KI Predictive Power defined as the ratio “Area between trained model curve (= validation or estimate samples) and that of a random model over the area between a perfect model curve and that of a random model”

KR Confidence defined as the ratio “Area between estimation curve and validation curve over the area between a perfect model curve and that of a random model”

These quantities are diagrammatically explained in Figure 5.

Figure 5 Definition of the Predictive Power (KI) and Predictive Confidence (KP) in a cross-validation graph during model training performed by the Modeler service of SAP Predictive Analytics.

Through an iterative procedure, the Modeler of SAP Predictive Analytics considers the trained model as optimal when the sum of KI and KR is maximized. For the task of detecting fraud transactions, the Modeler performed five iterations to obtain the best-trained model, as illustrated in Figure 5.

Figure 6 Development of the trained model through an iterative procedure based on the KI and KP quantities. The Modeler of SAP Predictive Analytics selects a trained model when the sum KI+KR is maximum.

Recursive feature elimination is automatically applied to select features by recursively considering smaller and smaller sets of features. In each iteration, the Modeler is automatically removing features from the training session that rank last in terms of importance ending up with overall 51 features. After training the model, the AUC-ROC, measured in Figure 6, and the accuracy was found to be 0.875 and 0.961, respectively.

Figure 7 ROC curves for the training (orange) and validation (blue) sets used in the training of the SAP Predictive Analytics. The prediction for an unskilled classifier is represented by the red diagonal line while the green shaded area represents the ROC for a wizard model.

The most important feature in the training was found to be V54, which records one of the characteristics of the transaction process initiated by the online customer.

LightGBM

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be computationally distributed and efficient providing accurate predictions. It handles large-scale data with low memory usage and supports CPU- and GPU-based acceleration for model learning. The hyperparameters with their best values used to train the boosted decision tree model are summarized in Table 3.

Table 3 Most important hyperparameters that optimize the LightGBM classification algorithm.

Standardization of the input training dataset by subtracting the mean and scaling to unit variance is a common requirement for many machine learning estimators. If a feature has a variance considerably higher compared to other features, it might dominate the objective function of the learning algorithm and make the estimator unable to learn from other features correctly as expected. Therefore, for boosted decision trees, individual features are passed through the standardization process. The learning algorithm can also converge faster when features are close to being normally distributed.

As in the case of Random Forests, a 5-fold cross-validation resampling procedure is used to evaluate the skills of LightGBM on unseen data. Also, a feature selection is performed in a similar way with the Random Forest model training. Features with importance scores less than 10 are considered unimportant and hence removed from the algorithm training. Cross-validation is also a powerful preventative measure against the danger of overfitting. After partitioning the data into k=5 folds, the AUC-ROC was found to be similar in the k-1 training and holdout folds. The mean AUC-ROC and accuracy after training the LightGBM trees were found to be 0.979 and 0.985, correspondingly.

The most important feature in the training was found to be Card1, which holds one of the credit card characteristics used by the customer during the online purchase order. Figure 8 shows the distribution of the different values that this numeric feature can record. Fraud transaction events tend to be associated with more central values of this credit card feature.

Figure 8 Distributions of the Card1 feature for fraud and non-fraud. Card1 is found to be the most impactful feature in the LightGBM algorithm training. Distributions are normalized to unit area.

Variable Ranking

The variable ranking is the process of ordering the features by the value of some scoring function, which usually measures feature-relevance and is specific to each algorithm. Table 4 shows a comparison of the top ten variables ranked by importance during the algorithm training of Random Forests with Scikit-Learn (left), SAP Predictive Analytics (center), and LightGBM (right). The C-, D- and V-features describe the online transaction purchase order. The ID categorical features contain information about the identity of the customer shipped to the online purchasing system during the transaction. For example, ID features can record information regarding the operation system, browser, screen resolution, login connection, etc. The Card-features hold information about the credit card used for the online purchase, Amt represents the amount spent, and “Addr1” and “Pemail” contain information about the geolocation and the e-mail domain of the customer, respectively.

Table 4 The top 10 variables ranked by importance during the training of the Random Forest, SAP Predictive Analytics, and LightGBM. The feature importance calculation is specific to the algorithm. The feature importance for Random Forests is multiplied by a factor of one hundred.

Performance Comparison

Table 5 summarizes the main results obtained by the three predictive programs, in order of ascending performance as defined by the AUC-ROC and accuracy. Random Forests have shorter training sessions, and higher accuracy and AUC-ROC with respect to SAP Predictive Analytics. At the cost of high computational time, LightGBM outperforms both SAP Predictive Analytics and Random Forests providing significantly better results with higher accuracy and AUC-ROC. Another critical measure in binary classification problems is the FP rate. FP is a test checking a single condition and wrongly gives an affirmative (positive) decision. Transactions flagged as FP can trigger a cardholder’s transaction to be denied or an account locked down causing thus additional inconvenience to the clients and organizations. For all algorithms involved in this study, the FP rate was measured to lie in the range of 2-3.5%.

Table 5 Performance summary of the three predictive software suites used for the prediction of fraudulent online transactions. LightGBM allows also for CPU-GPU acceleration for scalable model training (4 CPU cores were specified).

Conclusions

Transaction fraud is one of the most common types of online financial deceptions and typically occurs when a stolen payment card or data is used to generate an unauthorized purchase order. Retailers, merchants, banks, and other businesses have growing concerns about predicting and preventing fraudulent transactions. Effective protection relies on accurately distinguishing between legitimate customers and fraudsters in real-time using machine learning techniques. Therefore, accuracy and promptness in prediction are key values efficiently preventing fraudulent transactions from happening. The prediction of fraud events is a challenging task for machine algorithms, as these are challenged to solve a rather unbalanced classification problem.

Figure 9 Performance summary of the machine learning algorithms benchmarked on the fraud dataset challenge.

In this study, different algorithms were trained and tested on a benchmark dataset describing online fraud transactions. As this is a supervised machine learning task, data were labeled providing the possibility to measure the performance of each algorithm. Among the algorithms tested, LightGBM showed the best performance in predicting fraudulent transactions in terms of AUC-ROC and accuracy.

Credits

Credits to Christos Konstantinidis for the technical support on this work.

References

IEEE-CIS. (2019). From https://www.kaggle.com/c/ieee-fraud-detection

Infosecurity. (2018). Infosecurity. From https://www.infosecurity-magazine.com/news/global-fraud-hits-32-trillion/

LGBM. (2020). Microsoft LightGBM. From https://www.microsoft.com/en-us/research/project/lightgbm/

PwC. (2016). From https://www.pwc.com.au/publications/cyber-global-economic-crime-survey-2016.html

PwC. (2018). Pulling fraud out of the shadows; Global Economic Crime and Fraud Survey. Retrieved from https://www.pwc.com/fraudsurvey: https://www.pwc.com/gx/en/forensics/global-economic-crime-and-fraud-survey-2018.pdf

SAP. (2020). Predictive Analytics Software. From https://www.sap.com/products/predictive-analytics.html

Sklearn. (2020). Scikit-learn . From Machine Learning in Python: https://scikit-learn.org/

Vesta. (2020). Vesta Corporation. From https://trustvesta.com/

--

--