A realistic approach to Kaggle’s IEEE-CIS Fraud Detection Challenge

18 min readJul 24, 2021

Table of Contents -

Introduction
Business Problem
Machine Learning Problem
Data Preprocessing and Exploratory Data Analysis (EDA)
Choosing the Baseline Model
Choosing the Cross Validation Strategy
Initial Hyperparameter Tuning
Tuned Predictions
Data Cleaning
Feature Engineering
Adversarial Validation
Final Hyperparameter Tuning
Final Predictions
Summary
Deployment
Improvements and Future Work
Conclusion
References

1. Introduction -

Since the advent of Online Payment Systems, people with bad intentions have come up with new ways to deceive the common masses and make them fall into the fraud traps. The most common type of frauds are found in the form of Card Frauds since the majority of the people worldwide hold one or other type of cards which they commonly use for dispensing cash or making online transactions.

In the initial times when people became the victims of Card Frauds, they stopped using cards which became an issue for the Banks. In order to stop losing customers banks decided to lure people to keep using the Cards by giving Card Insurances and Protection facilities which helped the banks in keeping the accounts active for much longer time but moved the burden and losses due to Card Fraud on the shoulders of the Bank instead of the Card Holder.

In order to deal with the Card Frauds the banks decided to come with methodologies which can avoid the fraud beforehand and since then brilliant minds from various domains came up with brilliant ideas to avoid Card Frauds which worked well for some time but with passing time the fraudsters adapted to the changing environment and came up with new ways to fraud people and the methods implemented by the banks became stale.

Then came the era of Machine Learning which allowed people to have systems and methods which are capable of adjusting themselves as per the environment. The only requirement these systems had was the clean and good data representing the current environment as close as possible and with the availability of data more and more state of the art fraud detection systems were built and are still being built since the tech world today is changing very fast and so do the methods of fraud.

Continuing this trend, IEEE-CIS, an organization which works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence partnered with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry. They held a competition with the name “IEEE-CIS Fraud Detection” where the competitors were asked to benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta’s real-world e-commerce transactions and contains a wide range of features from device type to product features.

2. Business Problem -

2.1 Business Objective

The traditional approaches used for detecting the transaction fraudulence involve manual monitoring through humans which also involve interaction with the card holders. This approach is not very efficient as it not only consumes a lot of time but it is also quite expensive in terms of the resources it uses to detect the fraudulence of a transaction. Moreover, in the real world the majority of the transactions being manually monitored turns out to be legit and only a few of them turn out to be fraudulent which wastes a lot of time and energy. And, hence in this competition we need to come up with an automated screening system which requires minimalistic human intervention to detect the legitimacy of transactions using Machine Learning.

We will be predicting whether a transaction is fraudulent or not and to predict this we will be building an ML model using a real-world e-commerce dataset.

2.2 Constraints

The cost of predicting a legit transaction as fraudulent will lead to bad customer experience and predicting the fraudulent transaction as legit will lead to huge financial losses. And, hence the prediction must be as accurate as possible.
Suppose, a fraudulent transaction occurs and the customer knows about it after hours or days then it is of no use and hence the prediction should be instant. Therefore, we need to build a model which predicts instantly about the state of a transaction.
Interpretability is also partially important especially in cases where a transaction has been declared as fraud since one must know why a transaction has been declared as fraud.

3. Machine Learning Problem -

3.1 Data

The datasets can be downloaded from here.

The datasets provided by the Competiton Host are as follows,

train_transaction.csv : The transaction dataset comprising the transaction details to be used for training the model.
train_identity.csv : The identity dataset comprising the additional details about the identity of the payer and the merchant between whom the transaction was performed and the details of transactions are present in the train_transaction.csv.
test_transaction.csv : The transaction dataset comprising the transaction information to test the performance of the trained model.
test_identity.csv : The identity dataset comprising the additional identity information about the transactions present in the test_transaction.

Description of Transaction Dataset

TransactionID — Id of the transaction and is the foreign key in the Identity Dataset.
isFraud — 0 or 1 signifying whether a transaction is fraudulent or not.
TransactionDT — timedelta from a given reference datetime (not an actual timestamp)
TransactionAMT — Transaction Payment Amount in USD.
ProductCD — Product Code.
card1 — card6 — Payment Card information, such as card type, card category, issue bank, country, etc.
addr — Address
dist — Distance
P_emaildomain — Purchaser Email Domain.
R_emaildomain — Receiver Email Domain.
C1-C14 — counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
D1-D15 — timedelta, such as days between previous transactions, etc.
M1-M9 — match, such as names on card and address, etc.
V1-V339— Vesta engineered rich features, including ranking, counting, and other entity relations.

Following Features are Categorical in the Transaction Dataset,

ProductCD
card1-card6
addr1, addr2
P_emaildomain
R_emaildomain
M1-M9

Description of the Identity Dataset

Following Features are present in the Identity Dataset,

TransactionID — Foreign key to the Transaction Dataset.
id_01-id_38 — Masked features corresponding to the identity of the card holders.
DeviceType — Type of Device used to make the Transaction.
DeviceInfo — Information regarding the characteristics of the Device.

Following Features are Categorical in the Identity Dataset,

DeviceType
DeviceInfo
id_12 — id_38

Variables in this table are identity information — network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. They’re collected by Vesta’s fraud protection system and digital security partners.

(The field names are masked and a pairwise dictionary is not provided for privacy protection and contract agreement).

3.2 Mapping the real world problem to an ML Problem

Problem Type

Since, we need to classify a transaction as fraudulent or non-fraudulent and hence, this is a Binary Classification Problem.

Performance Metric

The organizers decided to evaluate the submissions on the area under the ROC curve between the predicted probability and the observed target. Hence, ROC-AUC will be our Key Performance Indicator (KPI).
We will be additionally using the Confusion Matrix to add more interpretability to the models.

To know why ROC-AUC is preferred over Accuracy in tasks like Fraud Detection, check this excellent answer.

4. Data Preprocessing and Exploratory Data Analysis (EDA) -

Complete Code for this section can be found here.

4.1 Initial Processing

The train_transaction dataset had a total of 590540 rows/data points and 394 columns/features, train_identity had a total of 144233 rows/data points and 41 columns. The difference in the number of rows in the train_transaction dataset and train_identity dataset made clear that the identity information is present for very few transactions and the majority of transactions have no identity information.
The test_transaction had a total of 506691 rows and 393 columns, test_identity had a total of 141907 data points and 41 rows.
I started by merging train_transaction and train_identity datasets, test_transaction and test_identity datasets using the TransactionID column which was a foreign key in the identity table referencing the transaction table. So, after merging I had two datasets namely train_datset (corresponding to the merged train_transaction and train_identity datasets) and test_dataset (corresponding to the merged test_transaction and test_identity datasets).

test_dataset and train_dataset had a mismatch in the name of id features. In the train_dataset id features were present with the name id_x where x was a value between 01 and 38 whereas in the test_dataset id features were of the form id-x. So, I changed the format of id features in the test_dataset from id-x to id_x.

Checked for duplicate rows or columns in the train_dataset and removed if any.
Though the train_dataset and test_dataset were sorted based on the TransactionDT column, I explicitly sorted both the datasets based on the TransactionDT column just to be double sure and saved both the datasets locally.

4.2 EDA

I won’t be explaining each and every finding of the EDA, instead I will be mentioning the key findings. To explore the indepth EDA, visit the notebook linked at the beginning of the section.

Some of the Key findings by EDA were as follows,

Majority of the Features had missing values and a very nice pattern of missing values was found in the V_features, where subgroups of V_features having the same number of missing values were present.

The dataset was extremely imbalanced.

Majority of the Transactions had a value of “W” corresponding to ProductCD and the second most found value was “C” but though the difference in the number of transactions having “W” and “C” was very large but the number of fraudulent transactions with value “W” and “C” were almost comparable which helped us in concluding that the transactions done for ProductCD “C” had the highest chance for being fraudulent as compared to other ProductCD categories.

card4 and card6 features corresponded to the card company and the card type respectively.

Almost 99% of the transactions had the same value for addr2 feature which helped in concluding that this feature corresponds to the Country Code and most of the transactions belong to the same country.

Majority of the transactions did not have the R_emaildomain value and the reason for this was the fact that not every transaction needed a transaction receipt and hence no information about the Receiver was present.

M1-M9 features were of the form True and False signifying whether a certain characteristic (like name on card match or not) was satisfied or not.

TransactionDT feature had the minimum value of 86400, which on further analysis helped in concluding that this feature is actually the number of seconds elapsed, since, 86400 = 24*60*60. This feature was very helpful and some of the very important features were created using this feature which are discussed later on in the Feature Engineering Section.

Train and Test Split was done based on time and there was some time gap between the train and test set.

C_features were highly intercorrelated.

The amount spent was highest for ProductCD “R” and was lowest for ProductCD “S” and “C”. For ProductCD “W”, “H” and “R” if the TransactionAmt was high then that transaction was generally fraudulent.

5. Choosing the Baseline Model -

Complete Code for this section can be found here.

I trained some of the popular and widely used tree and non-tree based classifiers on the given dataset without any data preprocessing and hyperparameter tuning to see which one of those is working best for the dataset and the one which performed best at this initial stage was chosen so as to be used in later stages.

The classifiers that I chose consisted of both the tree based (Decision Tree, Random Forest, Adaptive Boosting and Gradient Boosted Decision Tree) and non-tree based (Naive Bayes and Logistic Regression) classifiers. Hence, in order to use the dataset with these classifiers I needed to encode it as per the type of Classifier.

5.1 Data Preparation

Stored the “isFraud” column of train_dataset as y_train.
Dropped “TransactionID” column from both the datasets since it was only a unique identifier for each of the transactions and was of no use in deciding the Transaction Status.
Dropped the “isFraud” column from the train_dataset.
Dropped the “TransactionID” column of test_dataset.
Stored the modified train_dataset and test_datasets as X_train and X_test respectively.
Imputed all the non-numeric categorical missing values of the X_train and X_test with “missing”.
Label Encoded the Non-Numeric Features of X_train.
Label Encoded the Non-Numeric Features of X_test using the X_train. The main thing that I kept in my mind is to avoid test data leakage, often in Kaggle Competitions, Kagglers use the leakages from test data so as to get higher score but this technique (more of a shortcut) is only limited to Kaggle Competitions only. Hence, I approached the problem keeping in mind the real world constraints and tried to avoid test data leakage to the best of my ability.

Now, using X_train and X_test, I created three different sets (X_train1, X_test1), (X_train2, X_test2) and (X_train3, X_test3) each having the following characteristics,

The missing values of Set1 were imputed using -999 and the dataset was column normalized so as to be used with non-tree based models.
The missing values of Set2 were imputed using -999 but the dataset was not normalized so as to be used with Decision Tree, Random Forest and Adaptive Boosting.
Set3 was left untouched since XGBoost was used for applying GBDT and it can handle the missing values implicitly.

5.2 Results

Each of the Classifiers was trained and evaluated using its respective datasets and the results that I obtained were as follows,

From the results, it was clear that Tree based Ensembles were working best for our problem and further amongst the Tree based Ensembles, GBDT was giving the highest score. Hence, I decided to use GBDT as the Baseline model.

6. Choosing the Cross Validation Strategy

In the real world we do not have any explicit test data (or can’t know or use the performance of our model on test data beforehand) and hence in order to know the general performance we need to use a part (often referred to as Cross Validation Data) of our train dataset only. Now, the question arises is how to split the train data to get the Cross Validation Data?

There is no single answer to this question since the type of split to be done depends completely on the type of dataset and to get the Cross Validation Dataset which correlates with the test dataset is a very challenging and most important task. If we do not have a good cross validation strategy then we can never be sure about the performance of the model even though our model might be performing very good on our chosen cross validation data. Hence, it is very important to choose a right Cross Validation Strategy before moving on to final model building.

You can refer to this excellent blog to know more about Cross Validation Strategies.

The Local Cross Validation that I chose was to split the dataset based on time since the dataset had a timely nature and for final prediction I modified this strategy a little which will be discussed later on.

7. Initial Hyperparameter Tuning

Complete Code for this section can be found here.

Now, once I had selected the Baseline model, I performed hyperparameter tuning for the baseline model so as to know the improvement in the score and also because I would be using this model only to perform any model based data preprocessing or cleaning. Hence, it is better to do it using a tuned model so as to get more accurate and better results.

I performed simple Grid Search with time based 3-fold CV but instead of using Scikit-Learn’s Grid Search Implementation I used XGBoost Cross Validation method which allowed me to not have the number of trees as the hyperparameter by setting a large value for number of trees and use Early Stopping.

Following are the hyperparameters with their respective possible values that I chose to tune,

The results after performing Hyperparameter tuning were as follows,

8. Tuned Predictions

Complete Code for this section can be found here.

After tuning the baseline model and choosing the Local CV strategy, I obtained the CV dataset form the train dataset by performing a 80:20 time based split, where the first 80% of the data was taken as the train _dataset and the last 20% dataset was taken as the cv_dataset.

After obtaining the CV dataset I trained the model using the train dataset and computed the scores for the Cross Validation Dataset and the test dataset.

The results for the Cross Validation Data were as follows,

The feature importance was as follows,

The score obtained for the test dataset was as follows,

9. Data Cleaning

Complete Code for this section can be found here.

Now, from this section we will be looking into the steps that I followed to improve the score.

As the first step, I began with data cleaning where I removed some of the features as follows,

9.1 Removing Redundant Features

The following types of features were removed from the dataset,

Which had more than 90% missing values.
Which had a single value for complete column.
Which had more than 90% values the same.

Normally, we would have removed features having at least 60–70% missing values, but in this challenge our objective was to detect anomalies and columns with missing values are often a great source of information. Also, the dataset given to us had very less anomalous data points and hence further removing features would have reduced the amount of information that can be gained from the dataset. Thus, we chose a very high threshold.

9.2 Removing Collinear Features

Collinearity was checked amongst features belonging to the same category like C_features and D_features, V_features. Highly Collinear Features were not directly removed from the dataset based on just the high Variance Inflation Factor (a measure of collinearity) instead such features were removed if it did not lead to reduction in Cross Validation AUC.

9.3 Reducing the number of V_features

The largest number of features in the dataset are the V_features and while performing EDA we found a very interesting pattern in V_features, where subgroups having the same number of missing values were present in the V_features.

Now, in order to reduce V_features what was done is to basically use only that feature from each subgroup which had the most number of missing values, the chosen feature acted as a representative of all the features in that group.

9.4 Removing Time Inconsistent Features

The features which change over time were removed from the dataset and to know such features a simple XGBoost model was trained using the first 20% data and a single feature with “isFraud” as the target label, then the same model was used to predict the target labels of the last 20% data comprising of the data corresponding to the chosen single feature. If the model gave an AUC < 0.5 then that feature was dropped. The same process was repeated for each of the features.

After performing Data Cleaning the number of features reduced from 433 to 159.

10. Feature Engineering

10.1 Basic Feature Engineering

Complete Code for this section can be found here.

The following new features were created as part of the Basic Feature Engineering,

Transaction Minute
Transaction Hour — This feature was the most interesting feature and a very important finding from this feature was that the relationship between the hour with total number of transactions and the total number of fraudulent transactions is exactly inverse. The larger the number of transactions in an hour the lesser is the fraud percentage of that hour and the lesser the total number of transactions the higher is the fraud percentage.

Cyclic Nature of Transaction Hour and Transaction Minute feature — The cyclic behavior of the time and hour features was incorporated as follows,

Transaction Day — TransactionDay gave the exact day on which a transaction happened, since TransactionDT was the time elapsed from a given reference time and this reference time was the time when everything started. TransactionDT was the time in seconds so the TransactionDT corresponded to the number of seconds elapsed since everything started. Converting these seconds into days and considering the first day as the 0th day helped in finding the day for all other transactions relative to the first(0th) day.
Transaction Weekday
Dollars and Cents
Natural Logarithm of Transaction Amount
card1 divided by 1000
card2 divided by 10
Parent Domain of P_emaildomain
Domain Name of P_emaildomain
Top Level Domain of P_emaildomain
Parent Domain of R_emaildomain
Domain Name of R_emaildomain
Top Level Domain of R_emaildomain
Device Parent Company
Device Version
Operation System Name
Operating System Version
Screen Height
Screen Width
Interaction Features

Following Interaction Features were created,

card_intr1 : Interaction between card1_div_1000, card2_div_10, card3, card5 and card6.
card_intr2 : Interaction between card1, card2, card3, card5 and card6
card1_addr1
card1_addr2
card2_addr1
card2_addr2
card3_addr1
card3_addr2
card5_addr1
card5_addr2
card6_addr1
card6_addr2
ProductCD_addr1
ProductCD_addr2
card1_ProductCD
card2_ProductCD
card5_ProductCD
card6_ProductCD
addr1_P_emaildomain
card1_P_emaildomain
card1_addr1_P_emaildomain
Normalized D_features
Features to identify the Card Holders

Based on the information given by the competition host, once a credit card was marked as fraud, its status is not changed in future. Hence, our actual objective was not to predict the Fraudulent Transactions instead our task was to predict the Fraudulent Cards.

Now, in order to do so I needed to somehow identify the Card based on the information given. This was not an easy task since the actual meaning of most of the columns was not revealed. Hence, I tried various ways to identify the card. Now, based on the analysis I found the following possibilities to identify a card,

card1_card2_card3_card5_card6_addr1_P_emaildomain — Since, the card1-card6 feature corresponded to the information of cards, addr1 corresponded to the billing region and P_emaildomain was the purchaser domain.
card1_addr1_P_emaildomain — Since, card1 had no missing values and had majority of unique values so this feature combined with addr1 which corresponded to the region code and P_emaildomain would have sufficed.

There could have been other possibilities too but I decided to go with these only.

10.2 Advanced Feature Engineering

Complete Code for this section can be found here.

The Feature Engineering in this section was majorly inspired from this competition winning kernel and consisted of mainly Feature Aggregations and Frequency Encodings. The Transaction Month feature was also added to the dataset to be used while doing final predictions.

A total of 153 new features were added as part of the Feature Engineering resulting in a total of 312 features in the dataset.

11. Adversarial Vaidation

Complete Code for this section can be found here.

However good a model we build, if the test data has a different distribution than the test dataset, then the model will not perform well on the test dataset since the model has been trained using the train dataset only. This phenomenon where the distribution of dataset changes from train to test dataset is often termed as Covariate Shift. In general we always want the test data to have the same distribution as the train dataset so that we can build a model which generalizes well.

Adversarial Validation is a strategy to know whether the distribution of features have changed from the train to test dataset. In this strategy we combine both the train and test datasets and add a new column “isTest ‘’ to this concatenated dataset which indicates whether the given data point belongs to the test dataset or not.

Now, we train a model on this concatenated dataset using a single feature to predict the “isTest” column and compute the model performance based on the metric chosen by us. If the model performs well then it means the model was able to differentiate between train data points and test data points signifying that the feature chosen has a change in distribution from train to the test dataset. Hence, it is better to remove such a feature. The threshold at which the feature should be dropped is completely the user preference.

I performed Adversarial Validation for each of the features and saved the list of the features which resulted in an AUC > 0.7 locally so as to not use them during final training and prediction.

12. Final Hyperparameter Tuning

Complete Code for this section can be found here.

Now, before moving to final predictions and after adding a lot of features to the dataset and removing many redundant and useless features, I performed hyperparameter tuning of the model once again but this time using the Feature Engineered Dataset.

The optimal values of Hyperparameters were as follows,

13. Final Predictions

Complete Code for this section can be found here.

To compute Local CV predictions I simply splitted the train dataset into 80:20 percentage based on time and used the 80% as the train dataset and 20% data as the cross validation Dataset. Then, trained the XGBoost model with the final hyperparameters. The results were as follows,

The feature importance was as follows,

To make the test predictions I modified my Cross Validation Strategy and used GroupKFold with months as groups. The training data had months 12, 13, 14, 15, 16, 17. Fold one in GroupKFold trained on months 13 to 17 and predicted month 12. Note that the only purpose of month 12 was to tell XGB when to early_stop, we did not actually care about the backwards time predictions. The model trained on months 13 to 17 predicted test.csv which was forward in time.

The final results were as follows,

14. Summary

Following is the summary of the test scores obtained from beginning to the end,

15. Deployment

Complete Code for this section can be found here.

Just out of love for seeing the Machine Learning models in action, I deployed this model on my local machine. Check out this small demo.

16. Improvements and Future Work

Model building in the real world is an iterative process where we finalize and deploy a model initially based on whatever we can do as part of Data Cleaning and Feature Engineering. Once, the model is deployed, we start analyzing our model from the beginning in order to find those things which we had missed in our first go, we incorporate these findings to the model and also try to replace the dataset with more current data. This process is repeated as and when needed since the environment in which the model is deployed keeps changing and hene the model and dataset has to be updated too.

The final score that I got was decent and this score can further be improved by performing the process discussed in the above paragraph. You can take this as a task and try to improve the score keeping my work as the baseline.

17. Conclusion

In this blog, I discussed a realistic approach to IEEE-CIS Fraud Detection. I termed my approach as realistic due the fact that I have tried my best to follow all the practices that abide by the Real World Constraints and one of the most important Real World constraints is to avoid test data leakage.

Thank You for Reading. I hope you liked the Blog.

Do Comment, If you have any Query.

Connect with me on Linkedin.