Predicting Fraudulent Online Transactions: IEEE — CIS Fraud Detection

Frank Jin
4 min readJan 31, 2020

--

Written by Xingyuan Gu and Frank Jin

ThinkStock Photos

Have your credit or debits cards been charged without authorization? Do you have embarrassed experiences when cashier announces that your card has been declined in front of you and bags of groceries you are trying to buy? In this article, I would like to train machine learning models to improve the existing fraud prevention systems, by using a challenging large-scale dataset provided by Vesta Corporation on Kaggle.

Before we get started, let’s learn more about the data. Since some variables in the dataset are identity information (IP address etc.) and rich features engineered by Vesta, the provider masked most of the field names. As we can see, after merged the train descriptive dataset and the train transaction data, though we may guess the meaning of some columns, there are many columns with unknown meaning, which add many challenges for me and Xingyuan to understand the data.

Nonetheless, we decided to continue our Exploratory Data Analysis (EDA) in the training dataset. Since we observed many 0 in the isFraud column at our glance of data, we first wanted to check the total number of fraud transactions vs. the total number of non-fraud transactions to see if the data has balanced results.

The “1” in the graph indicates if the transaction is fraud, while “0” indicates if the transaction is not fraud. Not surprisingly, the number of fraud transactions is significantly lower the that of non-fraud transactions. This result is reasonable in that the prevention system from different banks functions effectively. Additionally, this result infers that, when we validate our models, we should use Receiving Operating Characteristics (ROC), which could eliminate the bias produced by the unbalanced sample; otherwise, the models could simply guess NotFraud on all transactions and still receive high accuracy rates.

Secondly, we explored about the distribution of transaction date time for Fraud transactions and NotFraud transactions. The color patterns are similar to the graph above, where the orange refers to Fraud transactions and the blue refers to NotFraud transactions. The graph shows a very interesting phenomenon, that fraud transactions are much likely to occur at a specific time period from 0.00 to 0.25. This finding could be very helpful and we decided to include this variable in our model.

The above graph shows the Tope 10 most used email addresses for purchasers and recipients in the dataset. Variable P_emaildomain describes the email domains that are used by purchasers while R_emaildomain is for recipients. We can find that most purchasers use Yahoo and Hotmail as their email addresses. However, for recipients, Yahoo and Hotmail seem not that popular. The most frequently used email domains are twc.com, cableone.net, ptd.net, suddenlink.net and q.com. All of these 5 email domains have very similar use frequency.

Furthermore, as we explored the density of transaction amount, we found that most transaction amounts are smaller than 200. Additionally, the density pattern of fraud transactions (referred by orange color) is very similar to the density pattern of non-fraud transactions (referred by blue color), so that the transaction amount variable may not be very effective in our future analysis.

Training the Model

To begin with, we first cleaned our data by removing columns whose missing percentages are higher than 80%. In addition, we also transformed categorical data to numerical data. After we split the train dataset and the validation dataset, we used XGB Classifier to train our model. The completed version of code is as follows:

Evaluate Our Model

Finally, we want to see how well our model performs. As mentioned in our EDA section, we should use ROC curve to evaluate our model, since it would eliminate the effect brought by the imbalanced dataset. Based on the graph below, our result is pretty well: the area below the curve is 0.94, which is much higher than regular situations.

--

--

Frank Jin

M.Sc. Quantitative Management (MQM): Business Analytics, Duke University; Master of Accounting, Ohio State University