Machine Learning for Fraud Detection Using XGBoost Classifier

Published in

Analytics Vidhya

7 min readDec 30, 2019

Introduction

Visualise standing at the check-out counter at the supermarket with a long line behind you and the cashier not-so-quietly announces that your card has been declined. At this moment, you probably aren’t thinking about the data science that determined your fate.

Although you are certain that you have funds to cover everything, the card still won’t accept the payment. After that, you step aside and allow the cashier to serve another customer, and you receive a notification or message from the bank that “Press 1 if you really tried to spend 500 on cheddar cheese”.

This moment is very embarrassing for everyone that meet this fate. It would be great if we can provide a good fraud prevention system that actually saving consumer or customer million of dollars per year. Researchers from IEEE Computational Intelligence Society, also known as IEEE-CIS, want to improve this scenario while also advancing the consumer experience. The better the performance of the system, the more money we can prevent.

IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.

Please refer to this link for the Competition website: IEEE-CIS Fraud Detection

IEEE-CIS Fraud Detection

Can you detect fraud from customer transactions?

www.kaggle.com

The main objective of this article is to provide a baseline model and methodology for fraud detection using the provided dataset from the competition. I hope that this article will help anyone who struggles to get started in a machine learning competition or who want to understand what AI or machine learning can be applied to a real-world project. We started this competition as a team of 3 members, and we want to share the methodology that helps us to be top 3% out of 6381 teams.

Without further ado, let’s get started!

The Aim Of This Competition

In this Competition, we build a machine learning model on a large-scale dataset that originates from Vesta’s real-world e-commerce transactions and contains a broad range of features from device type toward product features. We also have the opportunity to create new features based on the data to improve our results.

If we successfully invent a really good method, we will improve the effectiveness of fraudulent transaction alerts for millions of people around the earth, helping thousands of businesses reduce their fraud loss and increase their profits. Furthermore, we will save people who meet the same fate as us and put a smile on their face.

Environment Setup

Language: Python 3.5.5

Main Library:

Numpy
Pandas
Scikit-Learn
Seaborn
Matplotlib
CatBoost

Data Exploration

The dataset can be found and downloaded from Kaggle Website here.

The data is divided into two files identity and transaction, which are joined by TransactionID. Note that not all transactions have identical identity data.

Categorical Features — Transaction

ProductCD
card1 - card6
addr1, addr2
P_emaildomain
R_emaildomain
M1 - M9

Categorical Features — Identity

DeviceType
DeviceInfo
id_12 - id_38

The TransactionDT feature is a timedelta from a given reference DateTime (not an actual timestamp).

Column Description

Transaction Table

TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
TransactionAMT: transaction payment amount in USD
ProductCD: product code, the product for each transaction
card1 — card6: payment card information, such as card type, card category, issue bank, country, etc.
addr: address
dist: distance
P_ and (R__) emaildomain: purchaser and recipient email domain
C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
D1-D15: timedelta, such as days between previous transaction, etc.
M1-M9: match, such as names on card and address, etc.
Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Categorical Features
ProductCD
card1 — card6
addr1, addr2
Pemaildomain Remaildomain
M1 — M9

Identity Table

Variables in this table are identity information — network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They’re collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
DeviceType
DeviceInfo
id12 — id38

Acknowledgement: All the information was provided by the Kaggle Competition Website.

Methodology

Machine Learning Framework for Fraud Detection

Firstly, we start by merging the training data from both Transaction File and Identity file based on their unique ID.
Once we get the training sample, we then split this data into 5 evenly chunks of data samples. I would also like to mention that Stratified K-Fold is a bit different to normal K-Fold split in such a way that it can only be used for Binary Classification Problem like in our case (0 and 1 value). Stratified K-Fold split data into a number of fold that contains roughly the same number of samples in both classes in each chunk.
After all the data has been prepared, it is time for us to create a model/classifier to generalise these data and make a prediction.
Finally, ensemble the predictions by averaging them up to produce a final prediction.

XGBoost Parameter

clf = xgb.XGBClassifier(
        n_estimators=500,
        max_depth=9,
        learning_rate=0.05,
        subsample=0.9,
        colsample_bytree=0.9,
        missing=-999,
        random_state=2019,
        tree_method='auto',
        n_jobs = -1,
        
    )

Stratified K-Fold = 5 chunks of data

CODE

Import Library

import osimport numpy as np
import pandas as pd
from sklearn import preprocessing
import xgboost as xgb
from catboost import CatBoostClassifier

Loading Data

%%time
train_transaction = pd.read_csv('train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('test_transaction.csv', index_col='TransactionID')train_identity = pd.read_csv('train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('test_identity.csv', index_col='TransactionID')sample_submission = pd.read_csv('sample_submission.csv', index_col='TransactionID')train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)print(train.shape)
print(test.shape)y_train = train['isFraud'].copy()
del train_transaction, train_identity, test_transaction, test_identity# Drop target, fill in NaNs
X_train = train.drop('isFraud', axis=1)
X_test = test.copy()del train, test# Label Encoding
for f in X_train.columns:
    if X_train[f].dtype=='object' or X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(X_train[f].values) + list(X_test[f].values))
        X_train[f] = lbl.transform(list(X_train[f].values))
        X_test[f] = lbl.transform(list(X_test[f].values))

Reducing Memory Usage

%%time
# From kernel https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
# WARNING! THIS CAN DAMAGE THE DATA 
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
X_train = reduce_mem_usage(X_train)
X_test = reduce_mem_usage(X_test)

Clean any Null value represent in the data

#data cleaning
def clean_inf_nan(df):
    return df.replace([np.inf, -np.inf], np.nan)# Cleaning infinite values to NaN
X_train = clean_inf_nan(X_train)
X_test = clean_inf_nan(X_test) # replace all nan,inf,-inf to nan so it will be easy to replace
for i in X_train.columns:
    X_train[i].fillna(X_train[i].median(),inplace=True) # fill with median because mean may be affect by outliers.
#X.isna().sum().sum()
for i in X_test.columns:
    X_test[i].fillna(X_test[i].median(),inplace=True)

Main model

#%%time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_scoreEPOCHS = 5kf = StratifiedKFold(n_splits = EPOCHS, shuffle = True)y_preds = np.zeros(sample_submission.shape[0])
y_oof = np.zeros(X_train.shape[0])for tr_idx, val_idx in kf.split(X_train, y_train):
    
    clf = xgb.XGBClassifier(
        n_estimators=500,
        max_depth=9,
        learning_rate=0.05,
        subsample=0.9,
        colsample_bytree=0.9,
        missing=-999,
        random_state=2019,
        tree_method='auto',
        n_jobs = -1,
        
    )
    
    X_tr, X_vl = X_train.iloc[tr_idx, :], X_train.iloc[val_idx, :]
    y_tr, y_vl = y_train.iloc[tr_idx], y_train.iloc[val_idx]
    
    clf.fit(X_tr, y_tr)
    
    y_pred_train = clf.predict_proba(X_vl)[:,1]
    y_oof[val_idx] = y_pred_train
    
    print('ROC AUC {}'.format(roc_auc_score(y_vl, y_pred_train)))
    
    y_preds+= clf.predict_proba(X_test)[:,1] / EPOCHS

Evaluation Metric

In order to define how good our model is, we need to use the most suitable evaluation metric. Thankfully, Scikit-Learn provide many useful metrics for us to use. However, we will use the same metric as shown in the Kaggle competition website called ROC AUC SCORE.

ROC AUC SCORE: Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores. The higher the Better.

Result in each Fold/Chunk for Training

ROC AUC: 0.9645106433450377
ROC AUC: 0.9623382759183995
ROC AUC: 0.9628282124302243
ROC AUC: 0.9620205684446603
ROC AUC: 0.9622885595697329

Private LeaderBoard Score and Rank

ROC AUC: 0.929200

RANK: 153 out of 6381 teams

Achievement: Silver Medal

Total Team Members: 3

Conclusion

In conclusion, machine learning and deep learning has shown a promising result in solving a real-world problem. A fraud detection system is one of the good examples that machine learning can be applied. Although it can’t solve everything that we face right now, I believe that each invention is a stepping stone for advancement in technology for the new generation. Hopefully, in the near future, we are able to implement a more powerful system that prevents any bad things to happen.

Cheers!!!

Please follow me on Medium if you want to keep updated with my article and project :D