Credit Card Fraud Detection: A Case Study for Handling Class Imbalance

Joyce Annie George
Analytics Vidhya
Published in
7 min readJul 10, 2020
Source: pixabay

We live in an era where daily transactions are getting more reliant on credit cards than cash. As the frequency of transactions increases, the number of fraudulent activities also increases. In this article, we are using this Kaggle dataset for credit card fraud detection. Let us start the data analysis. According to the data description, there are 284807 rows and 31 columns in the data. As we are dealing with sensitive data, we have 28 variables that are transformed by PCA as numerical values. ‘Time’ and ‘Amount’ are the remaining two features. Time for each transaction represents the seconds elapsed between that particular transaction and the first transaction in our data. ‘Class’ is the response variable which has two possible values. The value is 1 for fraudulent transactions and 0 for genuine transactions. We don’t have any null values in the data. Let us dig further into our data by plotting the time and amount for genuine and fraudulent transactions.

We have information regarding only two days. So, we can’t make any conclusions regarding the timings at which the frequency of fraudulent activities are higher. Now, let us analyse our response variable.

plotting amounts of genuine and fraudulent transactions

The fraudulent transactions have very small amounts as compared to genuine transactions. The amounts are skewed too. Let us normalize the amount.

It is noticed that the fraudulent cases are too few as compared to genuine ones. Let us visualize the counts of genuine and fraudulent cases.

count plot of class

It is clear from the plot that our data is highly imbalanced. There are very few fraudulent transactions in the data. This situation leads to class imbalance problem. This condition is frequently found in several disciplines like fraud prediction, intrusion detection, spam detection etc. Let us take a deeper look into this problem. We have 284315 genuine transactions and 492 fraudulent transactions. If we apply a classification algorithm, the algorithm learns well about the genuine cases because there are a lot of data points. As the fraudulent cases are too few, the algorithm doesn’t learn much about them. So the final model can classify the genuine cases with perfection. But, it will predict a large number of fraudulent cases as genuine. For a financial institution, it is very important to detect all the fraudulent cases. Let us take a look at various options to deal with imbalanced data.

  • Data Collection

We know that the underlying cause for the problem is that we don’t have sufficient data for fraudulent cases. A good option is to improve the data collection methods so that we can obtain better quality data. Unfortunately, data collection won’t help us in fraud detection because we know that, in real life, fraudulent transactions are just a small fraction of all the genuine ones.

  • Performance Metrics

The performance of the model is usually evaluated using accuracy.

Now, let us perform classification and analyze the performance of the model.

We classified our data and analyzed the performance using accuracy as well as confusion matrix. Our model has a high accuracy of 99.8%. But, it is clear from the confusion matrix that the high accuracy is obtained by correctly predicting a large number of genuine transactions. Only 27 fraudulent transactions were correctly classified. There are 69 fraudulent cases which were incorrectly classified as genuine. This will result in a huge loss to the credit card company. So, we can’t rely on accuracy when dealing with imbalanced data. This situation is called accuracy paradox.

“It is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution.” — Machine Learning Mastery

Let us take a look at some popular performance evaluation metrics. Confusion matrix is a specific table layout which allows visualization of the performance of the algorithm.Confusion matrix gives us 4 important measures. True positives, true negatives, false positives and false negatives. In our case, the false negatives would be a good indication on the performance of the model. The false negatives are the fraudulent transactions which are incorrectly predicted as genuine transactions. If we have a large number of false negatives, we should either optimize or change the model. We can compute several other metrics from confusion metrics. Let us take a look at precision, recall and F1 score. Precision is a measure of the exactness of our model. Precision is the number of true positives divided by the number of true positives and false positives. In other words, it is the number of positive predictions divided by the total number of positive class values predicted. A low precision can also indicate a large number of false positives.

Recall defines the completeness of our model. Recall is the number of true positives divided by the number of true positives and the number of false negatives. In other words, it is the number of positive predictions divided by the number of positive class values in the test data. A low recall indicates many false negatives.

F1 score is the harmonic mean of precision and recall. When we have perfect values for precision and recall, we get perfect F1 score value of 1.

It is very important for a credit card company to detect all the fraudulent cases. Even if the algorithm predicts some genuine cases as fraudulent, the company can use additional security measures to confirm the transaction is genuine. But, a large number of false positive will lead to decline in customer satisfaction. So, we have to make sure that our model have high recall and a reasonably good precision. High value of F1 score ensures that we have a balance between precision and recall.

  • Resampling Techniques

Sampling is a good approach to deal with imbalanced data. We change the dataset in such a way that the model gets balanced data as input. There are two types of sampling — undersampling and oversampling.

The idea of undersampling is to delete some instances of the oversampled class to obtain class equivalence. If you have a very large dataset, it is a good approach to try undersampling. We have to make sure that we don’t loose any relevant information by undersampling.

Oversampling uses copies of the underrepresented class to get a balanced data. In our case, we add copies of fraudulent class so that we get equal number of genuine and fraudulent cases. This new dataset is given as input to the classification algorithm. If you are working with a small dataset, it is a good step to apply oversampling.

Synthetic Minority Oversampling Technique (SMOTE) is a very popular oversampling technique. SMOTE technique generates new synthetic minority class data which is used to train the model. It selects minority class data points that are close in the feature space and draws a line between the selected data points. A new sample is generated at a point along this line. This procedure is used to create the required number of synthetic samples of the minority class. This approach is effective because new synthetic samples from the minority class are relatively close in feature space to existing minority class data. A general downside of this approach is that synthetic samples are created without considering the majority class, possibly resulting in ambiguous samples if there is a strong overlap for the classes. Make sure that you split the data before sampling. Now, let us apply SMOTE technique and analyze the performance.

It is clear that the linearSVC performs better with SMOTE technique. The false negatives has dropped to 27 which increases the recall of the model.

  • Tree based Algorithms

We can try different algorithms and compare the performance of the model. Tree based classifiers generally work well with imbalanced data. As the tree based models work by learning from a hierarchy of if/else statements, they learn from all the classes. Random forest is an ensemble learning method which uses several decision trees. Each decision tree in the random forest performs the classification. The final output of the classification is determined by voting. The individual trees are not correlated and the the errors of one tree is corrected by others through voting. Now, let us apply random forest classifier and evaluate the performance.

It is noticed from the confusion matrix that the model works well with this imbalanced data.We have obtained pretty good precision, recall and F1 scores.

Conclusion

In this article, we discussed different approaches to handle an imbalanced dataset. Data collection methods help to balance the data in only some domains. We can use several other metrics like ROC and precision recall curve for performance evaluation. SMOTE and decision tree algorithms are very useful in dealing with imbalanced data. We can also try ratios other than 1:1 while sampling. There are also variants of SMOTE which may give better performance. We can try ADASYN for sampling the data.

References

--

--

Joyce Annie George
Analytics Vidhya

MS, CS, Santa Clara University. Passionate about data collection, analysis and visualization. https://www.linkedin.com/in/joyce-annie-george/