Banking on Ensembling in the Santander Kaggle Competition

Taraqur Rahman
The Biased Outliers
3 min readAug 28, 2018

Santander

Santander Bank (formerly known as Sovereign Bank, until the Santander Group bought it in 2008) provided a challenge in Kaggle to personalize banking for their clients. To achieve this, Santander wants to predict the value of transaction so that they can be ready to provide services that their clients might need.

Data Description

The data was anonymized, meaning that the the features provided was masked with strings so that the actual features that they collect is confidential. The good thing is that all the features were numerical.

Quick Dirty Linear Regression

The first thing we did was just run the raw data through a linear regression to set the bar for the results. We applied a K-Fold cross validation to see what the accuracy and variance would be. The RMSLE for the Dirty Linear Regression is 1.7E15 with a standard deviation of 2.7E15. (A LOT of room for improvement.)

Data Preprocessing

We knew that the RMSLE will be bad but had no idea it would be THAT bad. Looking at the numeric values the first thing to do was to scale the features. Scaling the features will definitely help with the algorithm for two reasons. First reason is that numerical values that are in the magnitude of 100s will not take precedent of those that have values of 1 or something smaller. Scaling will fix this issue so that every feature will be put in a similar range to make it an even playing field. The second reason is that it will help make the gradient descent a lot more efficient in timing and calculating the gradients.

Also there are almost 5000 features. This is a lot of features to deal with. So our first reaction was to reduce it using principal component analysis (PCA). The benefit of performing PCA is that we can select the features that produce the most variance and disregard features that barely has any variance. Features that have high variance typically provide a lot of information. Features that have low variance, do not contain much information. Therefore if we disregard the low-variance features, we can reduce the dimensions without losing too much information. Before running PCA, we had to remove highly correlated features (highly correlated features double the variance). Running PCA with n_components as None will give this graph.

The cumulative explained variance is showed by the line. The Individual explained variance is shown by the bars. The bars are extremely small (you can see some near the zeroth tick).

The graph shows the cumulative variance starting from the highest-variance feature. As you can see, it starts to plateau around 1000–1200 features. We decided to continue with 1000 features. (This decision is art part of data science).

Model

Linear Regression

Running a linear regression after scaling the features and transforming the data with PCA, we got RMSLE of 3.8 and a standard deviation of 0.13. The results seems more reasonable now.

Ridge Regression

To continue on this regression problem, we thought of shrinking some parameters using a ridge regression. We tuned the alpha parameter to 1.

Ridge(alpha=1, normalize=True)

Running a ridge gave us a RMSLE of 1.7 and standard deviation of 0.0426.

Random Forest

Next up we thought of running a random forest. Using grid search, we were able to get the best combinations of parameters. The best parameters we got were min_samples_leaf: 5, min_samples_split:6, n_estimators=200.

RandomForestRegressor(n_estimators=200, min_samples_leaf=5, min_samples_split=6)

This gave us an RMSLE of 1.5 and standard deviation of 0.0391.

Ensemble

To finish it off, we ensembled the two models. Our RMSLE is 1.57 and a standard deviation of 0.0434.

Next Steps

The next step we will want to take is run a deep neural network and see if there is an improvement in the results. There were only 4000 examples so we thought that DL was not necessary but still will be worth a try.

--

--