Credit card fraud detection

I recently found this publicly available dataset of credit card transactions on Kaggle so I thought it might be interesting to play with it a bit and see how good classification results can I get. In this article I’d like to share with you how to overcome imbalance in target classes, how to choose right metrics for your model as well as results I came up with.

Looking at the data

As the first step we’ll load our dataset into a Pandas data frame and print out some basic statistics about individual columns. This tells us whether we’ll need to normalize values before feeding them into our model or not. I’ve also plotted a correlation matrix to see if there is any correlation between features as well as to see what is the correlation between feature columns and the target column.

Target counts & correlation matrix

We can see that our target classes are hugely imbalanced. In fact, there is 284315 legitimate transactions and only 492 (0.17%) fraudulent transactions in the dataset. Correlation matrix tells us that there is no or very little correlation between features and if we look and correlation between individual features and the target column we can see that we can get rid of Time column as it doesn’t provide any information.

Dimensionality reduction

The next thing I wanted to check is if there is a separation between fraudulent and legitimate transactions. Given that our feature vector has 29 dimensions I’d be kind of hard to visualize it. We can use an algorithm called Principal Component Analysis (PCA) to reduce the dimensionality of our dataset down to 2 dimensions so we can plot it.

First two principal components obtained through PCA

The class imbalance issue

Let’s now have a quick look why having a huge imbalance between target classes is an issue. We have 284807 samples in our dataset and 492 of them are fraudulent. If we predicted “not fraud” for these 492 samples we’d get a 99.83% classification accuracy but our model would be useless. To get around that we’ll use confusion matrices, receiver operator characteristics (ROC) curves and precision recall curves as our performance metric. Our objective will be to maximize recall and trade a bit of the precision as it’s better to predict “fraud” on non fraudulent transactions than to miss a lot of the fraudulent ones.

Splitting the data

Before we start training the model we’ll need to split our dataset into a training and test portion. We’ll use the training portion to train model and then evaluate it on the test portion to see how it performs on samples it hasn’t seen before. It’s also important to perform a stratified sampling which means that the probability of seeing a fraudulent transaction will be approximately the same in both the training data and the test data. Stratified sampling also ensures that our model metrics are as close as possible to what we’d see in a whole population.

Training the model

Logistic regression

Let’s first train a simple LogisticRegression using the default parameters so we have a baseline score which we can compare our following models with.

from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(n_jobs=4)
lr_model.fit(train_X, train_y)
Baseline logistic regression — recall: 0.67

The imbalance in our dataset causes the model to generalize very well on legitimate transactions but it didn’t learn very much about the fraudulent ones. One way to address this would be to use a weighted loss function which takes the frequency of each class in the training data and then adjusts the gradient updates inversely proportional to those probailities, i.e. discriminates updates of the larger class and boosts updates of smaller class.

Weighted loss function

SciKit learn provides a class_weight parameter which we can set to balanced and allow the model to use the weighted loss function. The following depicts the model metrics I got after using the weighted loss function. As we can see the recall score is now much better while we lost a bit of the precision.

Logistic regression with balanced class weights — recall: 0.95

Undersampling

Another technique we can use to overcome class imbalance is to undersample the larger class or oversample the smaller class. Undersampling essentially means that we’ll remove most of the legitimate transactions from our data so that we have approximately the same amount of fraud and not fraud in the train/test split.

import numpy as np
def undersample(data, n=1):
positive_samples = data[data[target] == 1].copy().apply(np.random.permutation)
negative_samples = data[data[target] == 0].copy().apply(np.random.permutation).head(positive_samples.shape[0] * n)
    undersampled_data = pd.concat([positive_samples, negative_samples])
    return train_test_split(undersampled_data[feature_columns],  undersampled_data[target], test_size=0.3)

Following two figures are model metrics after training on undersampled data with 50% fraudulent and 50% legitimate transactions in the training data.

Logistic regression trained on undersampled data and evaluated on undersampled data — recall: 0.99

The first one shows the model evaluated on undersampled test data with recall score 0.99. This is great but there is an issue that the data now doesn’t represent what we’ve seen in real world. The next figure shows the model evaluated on our original test data and we can clearly see that while the recall score drops to 0.95 (which is what we got with the weighted loss function) precision score also goes down. This is nicely represented by F1 score being 0.98 for the model with the weighted loss function and 0.86 for the model trained on undersampled data.

Logistic regression trained on undersampled data and evaluated on original test data — recall: 0.95

Support vector machine

I’ve also tried to train a support vector machine model but only using the undersampled data as SVMs take a long time to train on large datasets. The results I’ve got are slightly worse than the results obtained from logistic regression trained on undersampled data.

SVM trained on undersampled data and evaluated on original test data — recall: 0.93

Random forest

The last model I’ve tried is the random forest model with the PCA that reduces dimensionality down to 16 dimensions. The PCA part is not necessary but it gives the model less features that are more representative and therefore reduces the chance of overfitting (in practice it improved recall by 0.03). Even though the recall was only 0.89 it has the least amount of false negatives and the best F1 score of all models mentioned in this article.

PCA + random forest confusion matrix — recall 0.89

Conclusion

Many real world datasets have imbalanced target classes similarly to the one I presented you in this article. Being aware of this fact and choosing right metrics when evaluating the model performance on test data helps us to better tune models as well to be more confident that they do what we think they do. I reached the best F1 score using random forest model with PCA which can be further optimized for better recall by decreasing a decision threshold. The best “out-of-the-box” model in terms of recall score seems to be a logistic regression with weighted loss function which reached recall 0.95 and overall F1 score 0.92. I should also point out that our dataset features are first 28 principal components obtained through PCA of the original dataset so the model we trained is kind of useless for real world data (unless you know what the original features were :) ).


The full code can be found in this iPython notebook. If you run this on your own you might get different scores as they depend on how the data gets randomized and split but they shouldn’t deviate by more than 3%. I hope you find this article useful and feel free to leave a comment below.