Photo by Freepik

Detecting credit card fraud with machine learning

Classifying and evaluating credit card transactions with logistic regression and decision tree

Rafael Bastos
Published in
7 min readJul 2, 2020

--

This article will address the problem of credit card fraud, a major concern for banks and customers, and the process of detecting fraudulent operations through machine learning techniques.

The scam usually occurs when someone accesses your credit or debit card numbers from unsecured websites or via an identity theft scheme to fraudulently obtain money or property. Due to its recurrence and the harm it may cause to both individuals and financial institutions, it is crucial to take preventive measures as well as identifying when a transaction is fraudulent.

It is quite likely that on some occasions you’ve had your credit card blocked when trying to make a simple purchase, causing stress and embarrassment. When this happens, your bank or credit card issuer may have detected a suspicious activity, which sometimes is just a false positive.

Because of the massive volume of data available for each customer and each financial activity, artificial intelligence can be utilized to effectively identify suspicious patterns in transactions. To increase the accuracy of the analyses, many institutions are investing in the improvement of AI algorithms.

Photo by fullvector on Freepik

About the Data

The dataset utilized in this project was provided by Kaggle and contains transactions made by credit cards in September 2013 by European cardholders. The transactions reported occurred within two days.

The dataset is highly unbalanced since it has 492 (0.17%) frauds out of 284,807 transactions.

All the features in the dataset are numerical. Due to client confidentiality, the columns were renamed to V1, V2, …, V28, and its features went through a PCA transformation, which consists in zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance. The only two exceptions were the features Time and Amount, containing the seconds elapsed between each transaction and the first transaction in the dataset, and the transaction amount, respectively.

The feature Class is the dependent variable and takes the values:

  • 0 for regular transactions
  • 1 for fraudulent activity

Exploratory Analysis

We are using the following libraries in this analysis.

Let’s start by examining the distribution of regular and fraudulent transactions over time.

Graph 1

Although the number of frauds is significantly smaller than the number of regular transactions, we can see a distinct behavior, especially around the 100,000 Time mark.

Notice that the number of regular transactions drops sharply around the 90,000th-second mark, to surge again around the 110,000th-second mark. It wouldn’t be absurd to assume that this period is during the night when individuals naturally perform fewer purchases and transactions than during the daytime.

On the other hand, a great number of fraudulent transactions occurred around the 100,000 mark, which could confirm the previous assumption, considering that criminals should prefer to commit fraud late at night, assuming there would be less surveillance and victims would not realize they were being scammed soon enough. Of course, this is just a hypothesis. Let’s wait and see how our machine learning model will interpret these numbers.

Doing the same analysis with the Amount of each transaction, we got the following result.

Graph 2

As for the transactions’ amount, apparently there is no significant insight we can gather from them. Most transactions, both regular and fraudulent, were of “small” values.

However, we can perform a deeper inspection with a boxplot graph.

Graph 3

Once again, there is no large disparity between the two distributions. Now let’s plot a correlation matrix to determine the correlation between the variables.

Graph 4

We did not observe any strong correlation between the variables. Remember that our data is highly unbalanced. After preprocessing the data, we should see a better-looking correlation matrix.

Now we can start handling our machine learning models.

Machine Learning Models

Before setting up the machine learning model, we need to follow three steps:

  • Preprocess the features Time and Amount with StandardScaler
  • Split the dataset into train and test data
  • Deal with the unbalanced dataset

Preprocessing

The StandardScaler will transform the data so that its distribution will have a mean value 0 and a standard deviation of 1. This is a crucial step in that the data is transformed to be easily interpreted by the machine learning algorithm.

Splitting into train and test data

After transforming the Amount and Time features, let's split our dataset into train and test data. The size of the test data is 0.25, the default value.

Balancing the dataset

As we mentioned before, the dataset is highly unbalanced. Since there is a severe skew in the class distribution (284,315 entries in Class = 0 and 492 in Class = 1), our training dataset could be biased and influence the machine learning algorithm to display unsatisfactory results, for instance, ignoring the class with fewer entries.

To address the problem we will randomly balance the data with the aid of RandomUnderSampler

Let’s view how is our distribution after balancing the dataset.

Graph 5

Notice that with the dataset balanced, we have 351 entries for each class. Let’s check the correlation matrix once again to see if we can determine some correlation.

Graph 6

Notice that with balanced data our correlation matrix shows meaningful correlations between some features, unlike the correlation matrix displayed earlier.

After completing these three essential steps, we can set up our machine learning models.

Classification Models

We are going to utilize two of the main machine learning classification methods:

  • Logistic Regression
  • Decision Tree

Logistic Regression

Logistic regression models are used to determine the odds of a certain class or event existing. In our case, it will establish the probability of a transaction belonging to Class 0 or 1, which is regular or fraudulent.

Measuring the quality of predictions
Graph 7

Notice that the model has an accuracy of 97% and a ROC AUC score of 95%, meaning that our logistic regression performed really well.

From the confusion matrix, we can conclude that 97% of the regular transactions were correctly classified as regular (True Negatives) and 93% of the fraudulent transactions were accurately classified as frauds (True Positives).

Decision Tree

We already saw that the logistic regression model performed well. Now let’s see how the decision tree performs.

Measuring the quality of predictions
Graph 8

The decision tree also performs quite well, with an accuracy of 97% and a ROC AUC score of 93%. The true negatives and true positives were well predicted, with 97% and 90%, respectively.

Let’s plot the decision tree to see how the decisions are made through its branches.

Graph 9

Notice how the decisions are made by dividing the inputs into smaller decision nodes and leaves.

Conclusion

Both models, Logistic Regression and Decision Tree performed extremely well in classifying credit card activities into the classes Regular Transaction and Fraudulent Transaction, with accuracy, ROC AUC scores and precision above 90%. Although they produced similar outcomes, the Logistic Regression showed slightly better results, with a greater ROC AUC score, which measures how well the model is capable to distinguish between classes.

It is important to point out how crucial it is to preprocess and balance the data. Remember how better the correlation matrix performed after the class data was balanced.

The machine learning algorithms for detecting credit card fraud are highly efficient, but there are still gaps to close. One of the biggest problems is the occurrence of False Positives, that is when the algorithm incorrectly detects a fraud. Thus, we are always searching for ways to shrink even more that 3% mark of False Positives.

For the full code, please refer to the notebook.

--

--