Tutorial: Xente Fraud Predictor

Byte Brilliance
9 min readJan 4, 2024

--

Note: This article is part of series designed to guide aspiring Data Scientists. The series is structured in such a way that Part 1 is for complete beginners and increases in complexity later on. Feel free to explore different parts of the series according to your experience level:

Some keywords:
1. The number of rows in your dataset is the number of samples you have.
2. The number of columns in your dataset is the number of features you have.
3. The target is what you are going to predict.
4. Any data points that are missing are called nulls.
5. Hyperparameter optimisation is the systematic process of finding the best settings for your machine learning model, allowing it to perform at its peak.

Introduction

Welcome to the 3rd part of the series. In this article we will explore an excellent project for beginners who are ready to start coding. If you still need to learn about Data Science or need help setting up your workspace please view Parts 1 and 2 of the series before continuing with this tutorial.

Zindi stands out as Africa’s pioneering platform for data science competitions. Serving as a comprehensive ecosystem, Zindi brings together a diverse community of scientists, engineers, academics, companies, NGOs, governments, and institutions united in their dedication to addressing the most critical challenges facing the continent through data-driven solutions.

One such competition is the Xente Fraud Prediction challenge. The objective of this competition is to create a machine learning model to detect fraudulent transactions.

Fraud detection is an important application of machine learning in the financial services sector. This solution will help Xente provide improved and safer service to its customers.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) serves as a crucial initial step in the data analysis process, offering valuable insights into the characteristics, patterns, and potential outliers within a dataset. The primary purpose of EDA is to understand the structure and nature of the data before diving into more complex analyses or modeling:

  1. Data Understanding: EDA helps in gaining a deeper understanding of the dataset by providing a snapshot of its features, distributions, and relationships. This understanding lays the foundation for informed decision-making throughout the analysis process.
  2. Pattern Recognition: EDA allows data analysts to identify patterns, trends, and anomalies within the data. This insight is instrumental in formulating hypotheses and guiding further investigations.
  3. Data Cleaning and Preprocessing: Through EDA, analysts can uncover missing values, outliers, and inconsistencies in the data. Addressing these issues during the EDA phase ensures a cleaner dataset, reducing the risk of biased or inaccurate results in subsequent analyses.
  4. Feature Selection: EDA aids in identifying relevant features or variables for analysis. By examining the correlation between different variables, analysts can make informed decisions about which features to include or exclude in their models.
  5. Visual Representation: EDA often involves creating visualizations such as histograms, scatter plots, and box plots to represent the data visually. These visualizations make complex patterns more accessible and enhance the communication of findings to a wider audience.
  6. Hypothesis Generation: EDA helps in generating hypotheses or research questions by revealing interesting patterns or relationships within the data. These hypotheses can then be tested more rigorously in subsequent analyses.
  7. Decision Support: For businesses and organizations, EDA provides a data-driven foundation for decision-making. Understanding the inherent characteristics of the data helps in making more informed choices regarding strategies, policies, or interventions.

In summary, Exploratory Data Analysis is a critical phase in the data analysis workflow that goes beyond mere data exploration. It is a systematic approach to understanding the underlying structure of the data, extracting meaningful insights, and ensuring the data is appropriately prepared for further analysis or modeling.

  1. Summary Statistics
Summary statistics:
CountryCode Amount Value PricingStrategy FraudResult
count 95662 95662 95662 95662 95662
mean 256 6718 9901 2 0
std 0 123307 123122 1 0
min 256 -1000000 2 0 0
25% 256 -50 275 2 0
50% 256 1000 1000 2 0
75% 256 2800 5000 2 0
max 256 9880000 9880000 4 1
==================================================

The above statistics gives us an idea of the spread of the data. This will help us when designing the Machine Learning model later on.
2. Dataset Info

 #   Column                Non-Null Count  Dtype  
--- ------ -------------- -----
0 TransactionId 95662 non-null object
1 BatchId 95662 non-null object
2 AccountId 95662 non-null object
3 SubscriptionId 95662 non-null object
4 CustomerId 95662 non-null object
5 CurrencyCode 95662 non-null object
6 CountryCode 95662 non-null int64
7 ProviderId 95662 non-null object
8 ProductId 95662 non-null object
9 ProductCategory 95662 non-null object
10 ChannelId 95662 non-null object
11 Amount 95662 non-null float64
12 Value 95662 non-null int64
13 TransactionStartTime 95662 non-null object
14 PricingStrategy 95662 non-null int64
15 FraudResult 95662 non-null int64

The above info shows us the datatypes of the columns (features) in our dataset. It also shows us that there are no missing (null) datapoints. Because Machine Learning models only accept numerical input, we will have to transform all the object (text) columns to numerical.

Feature Engineering

As discovered in our EDA, some variables are text and not numerical. To address this, we will use an OrdinalEncoder to convert these to numerical columns (individually) using the following code:

# Using an ordinal encoder
ordencoder_custid = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1).fit(transformed_data.loc[:, 'CustomerId'].values.reshape(-1,1))
transformed_data.loc[:, 'CustomerId'] = ordencoder_custid.transform(transformed_data.loc[:, 'CustomerId'].values.reshape(-1,1))

ordencoder_provid = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1).fit(transformed_data.loc[:, 'ProviderId'].values.reshape(-1,1))
transformed_data.loc[:, 'ProviderId'] = ordencoder_provid.transform(transformed_data.loc[:, 'ProviderId'].values.reshape(-1,1))

ordencoder_prodid = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1).fit(transformed_data.loc[:, 'ProductId'].values.reshape(-1,1))
transformed_data.loc[:, 'ProductId'] = ordencoder_prodid.transform(transformed_data.loc[:, 'ProductId'].values.reshape(-1,1))

ordencoder_chanid = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1).fit(transformed_data.loc[:, 'ChannelId'].values.reshape(-1,1))
transformed_data.loc[:, 'ChannelId'] = ordencoder_chanid.transform(transformed_data.loc[:, 'ChannelId'].values.reshape(-1,1))

ordencoder_prodcat = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1).fit(transformed_data.loc[:, 'ProductCategory'].values.reshape(-1,1))
transformed_data.loc[:, 'ProductCategory'] = ordencoder_prodcat.transform(transformed_data.loc[:, 'ProductCategory'].values.reshape(-1,1))

Furthermore, the TransactionStartTime column is in a datetime format, and we also want to convert this to numerical. To achieve this, we will extract the year, month, day, hour, minute and second components:

# Extract the year, month, day, hour, minute and second components
transformed_data['TransactionStartTime'] = transformed_data['TransactionStartTime'].str.replace('T', ' ')
transformed_data['TransactionStartTime'] = transformed_data['TransactionStartTime'].str.replace('Z', '')
transformed_data['TransactionStartTime'] = pd.to_datetime(transformed_data['TransactionStartTime'], infer_datetime_format=True)
transformed_data['Year'] = transformed_data['TransactionStartTime'].dt.year
transformed_data['Month'] = transformed_data['TransactionStartTime'].dt.month
transformed_data['Day'] = transformed_data['TransactionStartTime'].dt.day
transformed_data['Hour'] = transformed_data['TransactionStartTime'].dt.hour
transformed_data['Minute'] = transformed_data['TransactionStartTime'].dt.minute
transformed_data['Seconds'] = transformed_data['TransactionStartTime'].dt.second

Metrics

  1. Accuracy: Measures how many predictions you got right (both positive and negative) out of all your predictions (i.e. tells you the percentage of the time your model makes correct predictions).
  2. Precision: Measures how many of the things you predicted as positive are actually true positives (i.e. if you say something will happen, precision tells you how often you’re right).
  3. Recall: Measures how many of the true positives you actually caught (i.e. if something can happen, recall tells you how often you’re able to spot it).
  4. F1 Score: A balance between precision and recall. When you want both high accuracy in predicting positive cases and a good ability to catch all the positive cases.

Base Model

A “base model” in the Machine Learning world simply means a model that is fit to the data before any ‘advanced’ data pre-processing techniques are applied. Furthermore, while technically any model could serve as a base model, it is commonplace to use Linear Models. Some examples are Logistic Regression, Decision Trees and K-Nearest Neighbours (KNN).

In this instance, I have decided to use Logistic Regression as the base model. Logistic Regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables. It uses a special function, the Sigmoid, to create probabilities, and it learns from the past to make accurate predictions about the future. So, go ahead and let Logistic Regression be your guide in the enchanted realm of binary predictions!

The results obtained from the base model:

  • Accuracy : 99.79%
  • Precision : 0.4651
  • Recall : 0.3125
  • F1 Score : 0.3738
  • Zindi test score : 52.83%
Base Model Test Predictions Confusion Matrix

Although the accuracy is 99.79%, we can see from the other metrics and the confusion matrix that our base model has room for improvement. This highlights the importance of using more than one metric to track performance of a machine learning model, especially when working with imbalanced data.

Synthetic Minority Over-sampling Technique (SMOTE)

In most fraud prediction tasks, there are more cases of non-fraud than fraud (i.e. non-fraud is more prevalent than fraud). In order to ensure that a Machine Learning model learns the appropriate characteristics that separate fraud from non-fraud, SMOTE looks at the existing fraud data points and creates artificial data points with similar characteristics. In this way, the model becomes more fair and accurate in recognising both fraud and non-fraud.

While SMOTE is a valuable tool for addressing class imbalances, we must be sure to avoid data leakage. In simple terms, we want to train the model on the artificial data points but when testing the model we must be sure to only use the real-world data. While the synthetic data points do share similar characteristics to the real-world data, they do not provide an accurate representation of the data our model may encounter in the real world.

To avoid data leakage, we can split our dataset into training and testing sub-sets, and then apply SMOTE to the training set only:

from sklearn.model_selection import train_test_split 
from imblearn.over_sampling import SMOTE

# This code splits the dataset into the required sub-sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42, stratify=y)

# This code applies SMOTE to the trainig sub-set only
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Advanced Model

After applying SMOTE and some other data pre-processing techniques to our dataset, we are now ready to train a more advanced Machine Learning to try and improve upon the results obtained by the base model.

I have chosen to use a LightGBM model with a GridSearch for hyperparameter optimisation. I used a Nvidia GeForce RTX 3060 with 12GB memory to speed up training, but if you don’t have a GPU you can use a cpu by changing this line of code from

lgb.LGBMClassifier(device='GPU',gpu_platform_id=1, gpu_device_id=0, verbose=-1, random_state=42)

to

lgb.LGBMClassifier(device='CPU', verbose=-1, random_state=42)

The results obtained from the LightGBM model compared to the Base Model:

+===============+========+==========+
| Metric | Base | Advanced |
+===============+========+==========+
| Public Score | 52.83% | 65.45% |
+---------------+--------+----------+
| Private Score | 57.99% | 63.16% |
+---------------+--------+----------+
| Accuracy | 99.79% | 99.94% |
+---------------+--------+----------+
| Precision | 0.4651 | 0.7895 |
+---------------+--------+----------+
| Recall | 0.3125 | 0.9375 |
+---------------+--------+----------+
| F1 Score | 0.3738 | 0.8571 |
+---------------+--------+----------+

As you can see, by using SMOTE and a LightGBM model we have significantly improved on our metrics. Let’s look at the Confusion Matrix for the advanced model too:

Much better!

Conclusion

In this tutorial we looked at the Zindi Xente Fraud Prediction challenge where the objective is to create a machine learning model to detect fraudulent transactions.

We explored some EDA techniques to understand the underlying structure of the data. We also implemented data pre-processing techniques to transform our data into machine learning ready data. Because the dataset was highly imbalanced, we explored SMOTE to provide a training set that had equal representation of both fraud and non-fraud samples.

Training a Logistic Regression (i.e. the base model) we found that the performance of the model was pretty much on-par with random guessing. Using SMOTE and a GridSearch to tune the hyperparameters of a LightGBM model we found significant improvement in the performance of the model across all metrics!

Thank you for following this tutorial from Byte Brilliance. I invite you to explore further data pre-processing techniques and advanced machine learning algorithms to further improve on the metrics we obtained here!

All code is available in my GitHub Repo.

Remember to follow for more interesting Data Science related content!

--

--

Byte Brilliance

Data Science information, tutorials, and advice from an industry expert with multiple years of experience.