Machine Learning for Fraud Detection Using XGBoost Classifier
Introduction
Visualise standing at the check-out counter at the supermarket with a long line behind you and the cashier not-so-quietly announces that your card has been declined. At this moment, you probably aren’t thinking about the data science that determined your fate.
Although you are certain that you have funds to cover everything, the card still won’t accept the payment. After that, you step aside and allow the cashier to serve another customer, and you receive a notification or message from the bank that “Press 1 if you really tried to spend 500 on cheddar cheese”.
This moment is very embarrassing for everyone that meet this fate. It would be great if we can provide a good fraud prevention system that actually saving consumer or customer million of dollars per year. Researchers from IEEE Computational Intelligence Society, also known as IEEE-CIS, want to improve this scenario while also advancing the consumer experience. The better the performance of the system, the more money we can prevent.
IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.
Please refer to this link for the Competition website: IEEE-CIS Fraud Detection
The main objective of this article is to provide a baseline model and methodology for fraud detection using the provided dataset from the competition. I hope that this article will help anyone who struggles to get started in a machine learning competition or who want to understand what AI or machine learning can be applied to a real-world project. We started this competition as a team of 3 members, and we want to share the methodology that helps us to be top 3% out of 6381 teams.
Without further ado, let’s get started!
The Aim Of This Competition
In this Competition, we build a machine learning model on a large-scale dataset that originates from Vesta’s real-world e-commerce transactions and contains a broad range of features from device type toward product features. We also have the opportunity to create new features based on the data to improve our results.
If we successfully invent a really good method, we will improve the effectiveness of fraudulent transaction alerts for millions of people around the earth, helping thousands of businesses reduce their fraud loss and increase their profits. Furthermore, we will save people who meet the same fate as us and put a smile on their face.
Environment Setup
Language: Python 3.5.5
Main Library:
- Numpy
- Pandas
- Scikit-Learn
- Seaborn
- Matplotlib
- CatBoost
Data Exploration
The dataset can be found and downloaded from Kaggle Website here.
The data is divided into two files identity
and transaction
, which are joined by TransactionID
. Note that not all transactions have identical identity data.
Categorical Features — Transaction
ProductCD
card1
-card6
addr1
,addr2
P_emaildomain
R_emaildomain
M1
-M9
Categorical Features — Identity
DeviceType
DeviceInfo
id_12
-id_38
The TransactionDT
feature is a timedelta from a given reference DateTime (not an actual timestamp).
Column Description
Transaction Table
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 — card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
Categorical Features
ProductCD
card1 — card6
addr1, addr2
Pemaildomain Remaildomain
M1 — M9
Identity Table
Variables in this table are identity information — network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They’re collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)
Categorical Features:
DeviceType
DeviceInfo
id12 — id38
Acknowledgement: All the information was provided by the Kaggle Competition Website.
Methodology
- Firstly, we start by merging the training data from both Transaction File and Identity file based on their unique ID.
- Once we get the training sample, we then split this data into 5 evenly chunks of data samples. I would also like to mention that Stratified K-Fold is a bit different to normal K-Fold split in such a way that it can only be used for Binary Classification Problem like in our case (0 and 1 value). Stratified K-Fold split data into a number of fold that contains roughly the same number of samples in both classes in each chunk.
- After all the data has been prepared, it is time for us to create a model/classifier to generalise these data and make a prediction.
- Finally, ensemble the predictions by averaging them up to produce a final prediction.
XGBoost Parameter
clf = xgb.XGBClassifier(
n_estimators=500,
max_depth=9,
learning_rate=0.05,
subsample=0.9,
colsample_bytree=0.9,
missing=-999,
random_state=2019,
tree_method='auto',
n_jobs = -1,
)
- Stratified K-Fold = 5 chunks of data
CODE
Import Library
import osimport numpy as np
import pandas as pd
from sklearn import preprocessing
import xgboost as xgb
from catboost import CatBoostClassifier
Loading Data
%%time
train_transaction = pd.read_csv('train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('test_transaction.csv', index_col='TransactionID')train_identity = pd.read_csv('train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('test_identity.csv', index_col='TransactionID')sample_submission = pd.read_csv('sample_submission.csv', index_col='TransactionID')train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)print(train.shape)
print(test.shape)y_train = train['isFraud'].copy()
del train_transaction, train_identity, test_transaction, test_identity# Drop target, fill in NaNs
X_train = train.drop('isFraud', axis=1)
X_test = test.copy()del train, test# Label Encoding
for f in X_train.columns:
if X_train[f].dtype=='object' or X_test[f].dtype=='object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(X_train[f].values) + list(X_test[f].values))
X_train[f] = lbl.transform(list(X_train[f].values))
X_test[f] = lbl.transform(list(X_test[f].values))
Reducing Memory Usage
%%time
# From kernel https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
# WARNING! THIS CAN DAMAGE THE DATA
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
X_train = reduce_mem_usage(X_train)
X_test = reduce_mem_usage(X_test)
Clean any Null value represent in the data
#data cleaning
def clean_inf_nan(df):
return df.replace([np.inf, -np.inf], np.nan)# Cleaning infinite values to NaN
X_train = clean_inf_nan(X_train)
X_test = clean_inf_nan(X_test) # replace all nan,inf,-inf to nan so it will be easy to replace
for i in X_train.columns:
X_train[i].fillna(X_train[i].median(),inplace=True) # fill with median because mean may be affect by outliers.
#X.isna().sum().sum()
for i in X_test.columns:
X_test[i].fillna(X_test[i].median(),inplace=True)
Main model
#%%time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_scoreEPOCHS = 5kf = StratifiedKFold(n_splits = EPOCHS, shuffle = True)y_preds = np.zeros(sample_submission.shape[0])
y_oof = np.zeros(X_train.shape[0])for tr_idx, val_idx in kf.split(X_train, y_train):
clf = xgb.XGBClassifier(
n_estimators=500,
max_depth=9,
learning_rate=0.05,
subsample=0.9,
colsample_bytree=0.9,
missing=-999,
random_state=2019,
tree_method='auto',
n_jobs = -1,
)
X_tr, X_vl = X_train.iloc[tr_idx, :], X_train.iloc[val_idx, :]
y_tr, y_vl = y_train.iloc[tr_idx], y_train.iloc[val_idx]
clf.fit(X_tr, y_tr)
y_pred_train = clf.predict_proba(X_vl)[:,1]
y_oof[val_idx] = y_pred_train
print('ROC AUC {}'.format(roc_auc_score(y_vl, y_pred_train)))
y_preds+= clf.predict_proba(X_test)[:,1] / EPOCHS
Evaluation Metric
In order to define how good our model is, we need to use the most suitable evaluation metric. Thankfully, Scikit-Learn provide many useful metrics for us to use. However, we will use the same metric as shown in the Kaggle competition website called ROC AUC SCORE.
ROC AUC SCORE: Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores. The higher the Better.
Result in each Fold/Chunk for Training
ROC AUC: 0.9645106433450377
ROC AUC: 0.9623382759183995
ROC AUC: 0.9628282124302243
ROC AUC: 0.9620205684446603
ROC AUC: 0.9622885595697329
Private LeaderBoard Score and Rank
ROC AUC: 0.929200
RANK: 153 out of 6381 teams
Achievement: Silver Medal
Total Team Members: 3
Conclusion
In conclusion, machine learning and deep learning has shown a promising result in solving a real-world problem. A fraud detection system is one of the good examples that machine learning can be applied. Although it can’t solve everything that we face right now, I believe that each invention is a stepping stone for advancement in technology for the new generation. Hopefully, in the near future, we are able to implement a more powerful system that prevents any bad things to happen.
Cheers!!!
Please follow me on Medium if you want to keep updated with my article and project :D