Accelerating Credit Card Fraud Detection

Improving Machine Learning Performance with Intel-Optimized Software

Published in

Intel Analytics Software

4 min readDec 5, 2022

Has your credit card company ever sent you a warning about a potentially fraudulent transaction? Credit card fraud cost consumers $5.8 billion in 2021, up 70% from 2020 (source: CNBC). Better systems to detect fraudulent transactions are sorely needed.

Credit card fraud … poses a significant threat... In terms of substantial financial losses, trust and credibility, this is a concerning issue to both banks and customers alike. Due to the steep increase in banking frauds, it is the need of the hour to detect these fraudulent transactions... Machine learning (ML) can play a vital role in detecting fraudulent transactions and considering the scale at which these transactions happen, an ML approach is a commonly implemented solution. The automation pipeline needs to be accurate, offer fast inference times and have low memory usage (cf. GitHub repo).

This article explores credit card fraud detection using clustering and gradient boosting modeling.

Finding the Needle in the Haystack

Fraud detection is a needle-in-a-haystack problem. There are only a few instances of fraud among many legitimate purchases. The credit card dataset that I’m using has 284,807 samples (i.e., credit card transactions) with 30 features (Time, V1-V28, Amount, Class) per sample (Table 1). Class-0 represents a legitimate credit card transaction. Class-1 represents a fraudulent transaction.

Table 1. A small subset of credit card fraud data. The full dataset can be found here on Kaggle.

The pandas dataframe function df.Class.value_counts() shows that there are only 492 fraudulent transactions, so only 0.17% of the samples can teach the ML model what a bad transaction looks like. I could use anomaly detection techniques for this problem, but I want to see if applying clustering to get subsamples with higher concentrations of fraudulent transactions can be used to train an ordinary binary classifier. In this case, I used a more balanced subsample of about 1,000 rows from the original data set to train my model. The training data contains >30% fraudulent transactions.

I use the DBSCAN function in scikit-learn to do density-based spatial clustering for downsampling. Unlike k-means clustering, which assumes a convex shape, DBSCAN clusters can be any shape. Before clustering, I applied a 70/30 train/test split to the data: 199,364 samples for training and 85,443 for testing. I used the stratify optional argument in train_test_split to ensure a reasonable balance of legitimate to fraudulent samples.

Two important features (V14 and V16) were selected for DBSCAN clustering (Figure 1). The particular reasons for why V14 and V16 were selected is not important here — each dataset is unique and has some features that are more relevant to the target variable than others. In some cases, running principal component analysis is helpful in determining which features have a higher impact on the target variable.

Figure 1. DBSCAN cluster plot of V14 and V16 features

The Intel-optimized version of DBSCAN in the Intel Extension for Scikit-learn did the clustering about 3x faster than the stock version. The stock version took 23 seconds for the 200K training samples. The Intel-optimized version took seven seconds. Both were running on a 3.4 GHz Intel Xeon Gold 6128 processor with 12 physical cores.

Training and Inference

I trained a Light Gradient Boosted Model (LGBM) on the clustered credit card data. LGBM is a “gradient boosting framework that uses tree-based learning algorithms.” The daal4py framework powered by Intel oneAPI Data Analytics Library (oneDAL) was used to improve inference performance. Replacing the original LGBM prediction code

with a daal4py model

gives an inference speedup of around 4x. The stock LGBM model took around 1.3 seconds and the daal4py model took around 0.3 seconds on 854K input samples.

The confusion matrix for the model naively trained using the complete dataset is shown in Figure 2. The precision and recall are:

Precision = TP / (TP + FP) = 1050 / (1050 + 160) = 0.87
Recall = TP / (TP + FN) = 1050 / (1050 + 430) = 0.71.

Figure 2 Confusion matrix for inference of 854K test samples for a model trained on the full dataset

Figure 3 shows a confusion matrix for our model trained on the smaller but more balanced clustered dataset. It has better precision and recall even though the training set is much smaller:

Precision = 1080 / (1080 + 80) = 0.93
Recall = 1080 / (1080 + 400) = 0.73

Figure 3. Confusion matrix for using model trained on the clustered dataset, running inference on 854K test samples

Conclusions

ML can help identify patterns in transaction data that indicate whether a credit card transaction is legitimate or fraudulent. The faster and more accurately we can do this can save consumers and credit card companies lots of money and aggravation. The Intel-optimized version of DBSCAN significantly improves clustering performance, which is necessary to intelligently balance the training data. A smaller, more balanced training set decreases the time to train a model while also improving prediction accuracy. Even small improvements in precision and recall can greatly increase the number of transactions that are correctly classified.

I encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio. The code for this article is available at the Intel AI Reference Kit GitHub repository here.

Accelerating Credit Card Fraud Detection

Improving Machine Learning Performance with Intel-Optimized Software

Finding the Needle in the Haystack

Training and Inference

Conclusions

Written by Benjamin Consolvo