Santander Customer Transaction Prediction : A Simple Machine Learning Solution

14 min readNov 23, 2019

Santander Customer Transaction Prediction

Problem Statement:

In this problem, Santander Bank poses a challenge to Kagglers in order to help them with the problem of identification of the customers who will make a transaction with the bank in future, irrespective of the amount of money transacted previously with the bank. The data set provided is similar to the real data that is available to solve the problem, although the data that is provided to us for solving the problem is masked completely with only numeric values

The data is anonymous with no Customer details been revealed to the participants of the competition. The data sheet contains 200000 rows for both train and test data. The Train Data set has 202 columns with 200 columns having values for var_1 to var_200, one column for ID code and one column for target, which are the outcome of the transaction. The same columns are present for test data except for the target.

Data Acquisition:

All the data set for the problem is available in the below link:

https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

Evaluation Metric:

Submissions for the competition is evaluated on area under the ROC curve between the predicted probability and the observed target.

Submission File Format:

We must make a binary prediction for each ID in the test set provided. The provided should be in the below format:

And so on and so forth.

Data Set Description:

In the problem we are provided with an anonymized dataset containing numeric feature variables (199 variables), binary target column and the string ID column for each data point for the Train set. For the Test set, we are provided with numeric feature variables (199 variables) and the string ID column. The test dataset does not have the target column as this column we have to predict using the Train Data set.

File Description:

We are provided with 2 csv files:

1. train.csv — the training dataset

2. test.csv — the dataset on which we will test our prediction. This dataset has a few row, which will not be considered for scoring for the Kaggle competition. Around 50% of the data set is used for public scoring

Sample Train Data:

Sample Test Data:

Real Life Significance of the problem:

This project can help company in following ways-

1. Segmenting customers into small groups and addressing individual customers based on actual behaviors — instead of hard-coding any preconceived notions or assumptions of what makes customers similar to one another, and instead of only looking at aggregated data which hides important facts about individual customers.

2. Accurately predicting the future behavior of customers (e.g., transaction prediction) using predictive customer behavior modeling techniques — instead of just looking in the rear-view mirror of historical data.

3. Using advanced calculations to determine the customer lifetime value (LTV) of every customer and basing decisions on it — instead of looking only at the short-term revenue that a customer may bring the organization.

4. Knowing, based on objective metrics, exactly what marketing actions to do now, for each customer, in order to maximize the long-term value of every customer.

5. Using marketing machine learning technology that will reveal insights and make recommendations for improving customer marketing that human marketers are unlikely to spot on their own.

With the brief introduction to the problem, let us proceed to the approach we have taken for solving the problem. The first step to understanding any data is doing the Exploratory Data Analysis. However, before we proceed with that we have observed that the data loaded is a bit big on size. Therefore, we tried to reduce the size of the file before loading it.

Load Data and Reduce Memory Usage:

Function for Loading the Data Set with reduced Memory Usage

Loading the Data using the Function:

Exploratory Data Analysis:

Why EDA is Important:

Performing EDA is the basic and important step for any Machine Learning or Deep Learning project. It is important because it gives us an overall view of the data and helps us in undercovering the underlying information that might be present in the data. It helps us in the below way:

1. Get an overall view of the data

2. Focus on describing our sample — the actual data we observe — as opposed to making inference about some larger population or prediction about future data to be collected.

3. Identify unusual and extreme cases (outliers, quartile information, etc.)

4. Identify the obvious errors in our data which we might have missed otherwise

5. And it is itself gives a wide viewpoint of the data and shows us the path on how to proceed with the next steps for any given problem.

In our case we follow the below steps to do the EDA for the problem in our hand:

Step 1: We start with the target distribution for our data set:

We can see from the above plot that around 90% of our data has 0 (customer did not do a transaction), and around 10% of the data (customer did do the transaction). This shows that the problem in hand is a binary classification problem and this maks the data is very imbalanced.

Exact count of the target variables are as follows:

Outcome 0–179902

Outcome 1- 20098

Step 2: Once we find the distribution of the target, we try to find out the distribution across the variables.

Code:

We can see from this that the distribution is nearly Gaussian for all the variables for both the outcome 1 and for outcome 0.

Step 3: We also try to find out the basic statistics of the data set provided.

The reason we use as type np.float64 is because the data provided is float16 variables, which is not a supported pandas datatype, so we are unable to directly use the describe command which is present as a part of pandas. Therefore, we use a small hack, and change the variable type to float64 so we can utilize the pandas describe command. From the above we can observe that:

standard deviation is relatively large for both train and test variable data;
min, max, mean, sdt values for train and test data looks quite close;

Step 4: Finding Missing values:

For any problem in hand it is very important to find the missing values, if any, for the problem is very important. It is one of the most common problem for any machine learning problem. So let us find out if there is any missing value for the dataset that is present with us:

We can observe from above that there is no missing values for both our train and test data.

Step 5: Duplicate values

One of the other frequent problem for machine learning is duplicate values. Let us check in our dataset for duplicate values

We can see that the same columns in train and test set have the same or very close number of duplicates of same or very close values. This is an interesting pattern that we might be able to use in the future.

Feature Engineering:

Let us begin with some feature engineering. As this is all numeric data we start with simple feature engineering like:

1. Sum

2. Min

3. Max

4. Mean

5. Standard Deviation

6. Skewness

7. Kurtosis

8. Median

Let us verify the data in our train and test data:

We drop the Kurtosis from both the train and test data as the value is NaN.

Dropping the Kurtosis feature

Distribution of the newly created features in Train data:

Distribution in Test data:

We can see from above that the dimension of the data is very high. We try Dimensional reduction using TSNE for better visualization of the data and getting to know the spread of the target data in our data set.

Code:

Observation: As we can see from above the data cannot be separated using TSNE. The points are massively overlapped with positive points concentrated in the middle and the negative points surrounding it.

Solution with Simple Feature Engineering:

1. Light GBM Model:

What is LGBM:

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

· Faster training speed and higher efficiency.

· Lower memory usage.

· Better accuracy.

· Support of parallel and GPU learning.

· Capable of handling large-scale data.

Difference from GBDT and Decision Trees:

Light GBM grows tree vertically while other algorithm grows trees horizontally which means that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

Hyperparameters of our Light GBM Model:

Modelling:

Training Result:

Feature Importance:

Prediction on Test Data:

With this prediction on Kaggle we manage to get an AUC score of 0.89781, with a rank of 1278

This prediction is not good enough. We also try some classical Classification algorithm to check if it gives us a better prediction.

2. Logistic Regression:

What is LR?

Logistic regression is another technique borrowed by machine learning from the field of statistics, a go-to method for binary classification problems (problems with two class values).

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.

Representation of user Logistic Regression:

Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as the Greek capital letter Beta) to predict an output value (y). A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value.

Below is an example logistic regression equation:

Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the single input value (x). Each column in your input data has an associated b coefficient (a constant real value) that must be learned from your training data.

Hyper parameter Tuning:

Train Result:

Test Result:

Prediction on Test Data:

With this we get a Kaggle score of 0.66678 and a rank of 4588.

Let us try one more classification algorithm that works well on numeric data, Gaussian Naïve Bayes.

3. Gaussian Naïve Bayes:

What is Gaussian Naïve Bayes:

Naive Bayes can be extended to real-valued attributes, most commonly by assuming a Gaussian distribution. This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can be used to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the easiest to work with because you only need to estimate the mean and the standard deviation from your training data.

Representation of Gaussian Naïve Bayes:

Above, we calculated the probabilities for input values for each class using a frequency. With real-valued inputs, we can calculate the mean and standard deviation of input values (x) for each class to summarize the distribution. This means that in addition to the probabilities for each class, we must also store the mean and standard deviations for each input variable for each class.

Learning a Gaussian Naïve Bayes Model from Data:

This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.

Where n is the number of instances and x are the values for an input variable in your training data.

We can calculate the standard deviation using the following equation:

This is the square root of the average squared difference of each value of x from the mean value of x, where n is the number of instances, sqrt() is the square root function, sum() is the sum function, xi is a specific value of the x variable for the i’th instance and mean(x) is described above, and ^2 is the square.

Making Predictions from Gaussian Naïve Bayes:

Probabilities of new x values are calculated using the Gaussian Probability Density Function (PDF).

When making predictions these parameters can be plugged into the Gaussian PDF with a new input for the variable, and in return the Gaussian PDF will provide an estimate of the probability of that new input value for that class.

Where pdf(x) is the Gaussian PDF, sqrt() is the square root, mean and sd are the mean and standard deviation calculated above, PI is the numerical constant, exp() is the numerical constant e or Euler’s number raised to power and x is the input value for the input variable.

We can then plug in the probabilities into the equation above to make predictions with real-valued inputs.

For example, adapting one of the above calculations with numerical values for weather and car:

go-out = P(pdf(weather)|class=go-out) * P(pdf(car)|class=go-out) * P(class=go-out)

Hyperparameter Tuning:

Train Result:

Test Prediction:

With Naïve Bayes we are able to get a Kaggle AUC of 0.67869 and a rank of 4466

Hence, we can conclude that with simple Feature Engineering and classification models we are not getting our desired result hence we move to our other solution.

Solution 2: Fake Test Removal and Data Augmentation:

Why Fake Test Removal:

The statistics of training set and test set are very similar. However, one thing that caught our eye was the fact that the distribution of the number of unique values (across features) is significantly different between training set and test set. It seems that the test set consists of real samples as well as synthetic samples that were generated by sampling the real samples feature distributions (These are probably the “rows which are not included in scoring”). If this is correct, then finding out which sample is synthetic, and which is real should be relatively easy task:

Given a sample, we can go over its features and check if the feature value is unique. If at least one of the sample’s features is unique, then the sample must be a real sample. It turns out that if a given sample has no unique values then it is a synthetic sample. (It doesn’t have to be like that, but in this dataset the probability is seemingly to low that this would not be the case). This way the unusual bumps on the distribution peaks of test set features are captured. The magic features are extracted from the combination of training set and real samples in the test set.

Code:

So we can see from above that there is 100000 real data samples and 100000 synthetic data samples in our test set. So, let us check the same in our public data set which will be used for scoring:

We can see in our public set there is 50000 samples in our public score and 50000 in our private sample

Data Augmentation:

Data augmentation means increasing the number of data points. Oversampling the data increases CV and leaderboard score significantly since the data is imbalanced. This oversampling technique is shared by Jiwei Liu in this kernel

Once this is done let us try to implement our LGBM solution now which has previously got us a Kaggle score of 0.89781.

Implementation:

Hyperparameters:

Training:

Prediction on Test Data set:

Kaggle Score:

With this we are able to achieve a Kaggle score of 0.92180 and that with a rank of 70.

Conclusion:

We can still improve the model with a bit of hyperparameter tuning of the LGBM model and also we can try some deep learning model so that it can help us in getting a better score.