Santander Customer Transaction Prediction: An End to End Machine Learning Project

raghunandepu
Analytics Vidhya
Published in
7 min readNov 11, 2019

Can you identify who will make a transaction?

Table of Contents:

  • Problem statement
  • Data Acquisition
  • Evaluation Metric
  • Data Description
  • Exploratory data analysis
  • Feature engineering
  • Solution 1: Light GBM Model with simple features
  • Solution 2: Feature Engineering & LightGBM model with additional features
  • Conclusion
  • Procedure Followed
  • Model performance comparison

Problem Statement:

In this challenge, Santander invites Kagglers to help them identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.

The data is anonymous, each row containing 200 numerical values identified just with a number.

We will explore the data, prepare it for a model, train a model and predict the target value for the test set, then prepare a submission.

Data Acquisition:

This is from Kaggle competition. Download the dataset from the below source.

Source: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview

Evaluation Metric:

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

Submission File

For each Id in the test set, you must make a binary prediction of the target variable. The file should contain a header and have the following format:

ID_code,target
test_0,0
test_1,1
test_2,0
etc.

Data Description

You are provided with an anonymized dataset containing numeric feature variables, the binary target column, and a string ID_code column.

The task is to predict the value of target column in the test set.

File descriptions

  • train.csv — the training set.
  • test.csv — the test set. The test set contains some rows which are not included in scoring.

Exploratory data analysis

Let’s load the train and test files.

Both train and test dataframes have 200000 entries. Train has 202 columns and test has 201 columns.

Train contains:

  • ID_code (string);
  • target;
  • 200 numerical variables, named from var_0 to var_199;

Test contains:

  • ID_code (string);
  • 200 numerical variables, named from var_0 to var_199;

Check for missing data:

Let’s check if there is any missing data.

We defined a helper function missing_data and applied on train and test data.

There is no missing data in both train and test datasets.

Let’s check the numerical values in train and test sets.

Observations:

  • standard deviation is relatively large for both train and test variable data
  • min, max, mean, sdt values for train and test data looks quite close
  • mean values are distributed over a large range

Let’s check the distribution of target value in train data set.

Observation:

We can see that the data is unbalanced. Majority values have 0 as target value.

Distribution of Mean and Standard Deviation:

Distribution of skew and kurtosis:

Correlation of Features:

The correlation between the features is very small.

Duplicate values:

Let us check if there were any duplicate values in train and test data sets.

We can see that there are duplicates and top 15 max of duplicate values in train and test sets are displayed above.

Observation: Same columns in train and test have the same or very close number of duplicates.

Feature engineering

Let us create simple features like sum, min, max, mean, std, skew, kurt and median as below.

Let’s check the newly created features:

Let’s check the distribution of these newly engineered features.

Distribution of new features in train data set
Distribution of new features in test data set

Rounded features:

Let us create some rounded features as below.

Let us see how these features look like.

Let’s now build the machine learning model.

Solution 1: Light GBM Model with simple features:

What is Light GBM?

Light GBM is a gradient boosting framework that uses tree based learning algorithm.

How it differs from other tree based algorithm?

Light GBM grows tree vertically while other algorithm grows trees horizontally which means that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

Hyperparameters of light GBM model:

Let’s train the model with the above selected hyperparameters.

Feature Importance:

Some of the most important features that observed from this model are as below.

Prediction on test data:

Let us now predict the results on test data.

RESULT : Leaderboard Score : AUC — 0.90014 (Rank 335 which is top 3.8%)

Solution 2: Feature Engineering & LightGBM model

Feature Engineering and Data Augmentation

1. Separating the Real/Synthetic Test Data and Magic Features

Using unique value count in a row to identify synthetic samples. If a row has at least one unique value in a feature, then it is real, otherwise it is synthetic. This technique is shared by YaG320 in this kernel — List of Fake Samples and Public/Private LB split(Reference: https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split) and it successfuly identifies synthetic samples in test set. This way the unusual bumps on the distribution peaks of test set features are captured. The magic features are extracted from the combination of training set and real samples in the test set.

2. Data Augmentation

Data augmentation means increasing the number of data points. Oversampling the data increases CV and leaderboard score significantly since the data is imbalanced. This oversampling technique is shared by Jiwei Liu in this kernel (Reference: https://www.kaggle.com/jiweiliu/lgb-2-leaves-augment).

Let us now try to implement these with LighGBM model.

Model — LightGBM Gradient Boosting Decision Tree

Hyperparameters of the model:

Implementation:

Feature Importance:

Prediction on test data and submission:

RESULT : Leaderboard Score : AUC — 0.92358 (Rank 19 which is top 1%)

Conclusion:

Procedure Followed

  1. Performed Exploaratory data analysis on train and test datasets.
  2. Performed checks for missing values and duplicate data.
  3. Observed various plot of density, distribution of mean, skewness, kurtosis and both train and test and also both target values 1 and 0.
  4. Observed if there was any correlation between the features
  5. Performed feature engineering and created features
  6. Applied machine learning model light Gradient boosting machine on these features and observed ROC of 0.90014.
  7. To futher improve the model’s score, used unique value count in a row to identify synthetic samples and differentiated real and synthetic samples. Created magic features in both train and test sets using this data.
  8. As a data augmentation step, performed data oversampling as the data is imbalanced.
  9. Applied machine learning model lightGBM again with newly created magical features and it helped to improve ROC score to 0.92358.

Model performance comparison

Light GBM with additional features improved model score to 0.92358 and this is pretty much close to kaggle’s leaderboard top scores. More can be definitely done.

Thanks for reading!

REFERENCES:

https://www.kaggle.com/c/santander-customer-transaction-prediction/notebooks

--

--

raghunandepu
Analytics Vidhya

Data Scientist | Machine Learning | Deep Learning | Data Science. You can find me on Linkedin at https://www.linkedin.com/in/raghunandepu/