Customer Transaction Prediction

Published in

Analytics Vidhya

11 min readDec 12, 2019

As the title says, this blog is about a kaggle competition titled Santander customer transaction. For any finance-based company, the most crucial thing is to have the information about whether the customer is going to use their service in the future or not. The same situation is given by Santander an online bank who asked us to solve a challenge to predict whether a customer will do a transaction in the future or not. I will tell you how can you solve this challenge and what feature engineering I have done to get the AUC result of 90% and rank in the top 4%.

Problem Statement

The problem statement of this kaggle challenge is to predict if a customer will do the future transaction or not irrespective of the amount of money transacted. The constraints in this are we should give the probability-based prediction’s.

Machine learning problem statement

So till now, we have seen what the problem and what are the constraints of the problem, but all those stuff were related to the business point of view. For solving this challenge using machine learning we have to transform this into a machine learning problem statement and guys this thing is not only limited to this particular problem statement but all other challenges as well. Before solving any real-world problem we first have to transform it into a machine learning problem. For this question, our ML problem statement will be like: “This one is a classical Binary classification machine learning problem where we have to predict the customer will do a future transaction or not with the evaluation metrics be AUC.”

Dataset

The dataset is anonymized so we cannot know which feature is what. There is a total of 200 features in this data set along with ID_code and target columns. The target columns contain 0 and 1 value where 0 means the customer will not do a transaction and 1 means the customer will do a transaction.
You can download the dataset from the following link:

https://www.kaggle.com/c/santander-customer-transaction-prediction/

Importing the libraries

Till now you are completely aware of the problem statement and its constraints so now let gets are hand dirty and start with the coding part. First, we start by importing all the necessary libraries.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
import lightgbm as lgb
import timeit
import time

Exploratory Data Analysis

So we are done with loading the libraries, let’s start with the exploratory data analysis (EDA). Guys EDA is the most important step in solving any machine learning problem it helps you to know your dataset deeper and helps you in deriving the new feature for better learning of your model. we start our EDA by loading the data first.

data = pd.read_csv(‘train.csv’)#train data
data_test = pd.read_csv(‘test.csv’)#test datadata.head()

After loading the train and test data I did some columns rearrangement and put the target column to the end and saved the data into new CSV file name data_train.csv.

Checking for the balance of dataset

data_train[‘target’].value_counts()

ax = sns.countplot(‘target’,data=data_train)
print(“percentage of data belongs to 0 :”, data_train[‘target’].value_counts()[0]*100/200000,”%”)
print(“percentage of data belongs to 1 :”, data_train[‘target’].value_counts()[1]*100/200000,”%”)
ax.plot()

As we can see that around 90% of data belongs to 0 class and only 10% belongs to class 1 hence we can conclude that the data is an imbalanced data

Checking for the null values

data_train.isnull().sum()

A good thing for us is, as we can see there are no null values in any feature so no need for imputation of missing value.
So now let’s check what is the distribution of all the feature towards the target values 0 and 1 for that I have written the following function and since we have 200 feature we plot this distribution in two parts in the first part I have taken 100 features and in the second I have taken the rest 100 features.

def feature_distribution(data_1,data_2,target_0,target_1,features_list):
# Here we are setting the style of the plot and grid in it
sns.set_style(‘whitegrid’)
plt.figure() # Here we are initializing the plt figure object
# Here we are creating the subplot and initialzing it size and row col size
fig, ax = plt.subplots(10,10,figsize=(18,22))
for plot_count, feature in enumerate(features_list):
#plotting the plots here for every plot feature
plt.subplot(10,10,plot_count+1)
#plotting the pdf plot for every feature towards the target value
sns.distplot(data_1[feature], hist=False,label=target_0)
sns.distplot(data_2[feature], hist=False,label=target_1)
plt.xlabel(feature, fontsize=9)# Here we are setting the x axis label
locs, labels = plt.xticks()
# Here we are setting the ticks for x and y axis
plt.tick_params(axis=’x’, which=’major’, labelsize=6, pad=-6)
plt.tick_params(axis=’y’, which=’major’, labelsize=6)
plt.show();
## Dstribution for the first 100 features
target_0_data = data.loc[data_train[‘target’] == 0]
target_1_data = data.loc[data_train[‘target’] == 1]
features = data.columns.values[1:101]
feature_distribution(target_0_data, target_1_data, ‘0’, ‘1’, features)

sample distribution of first 100 features towards target class

Similarly, we have done for the rest of the 100 features and came out to be the following conclusions :
1) By looking at the distribution of each feature towards the target values I found that most of the features have different distribution
for the target values.
2) We can also say that there are some features that are quite close to normal distribution not completely but a little.
3) Hence I can say that there are some kind of processing is done on the data.

Let's check the distribution of mean and std of the data
Like we have done with the features let’s check for the mean and std distribution of data.

plt.figure(figsize=(16,6))
sns.set_style(‘whitegrid’)
features = data_train.columns.values[1:202]
plt.title(“Distribution of mean values per row in the data”)
plt.xlabel(‘mean value’)
plt.ylabel(‘pdf value’)
sns.distplot(data_train[features].mean(axis=1),color=”green”, kde=True,bins=120)
plt.show()

Distribution of mean values per row in data

Following observation is made after looking at the mean graph for row values of data:
1) The above graph shows the distribution of the means of each feature along the row and it seems to follow the kind of gaussian.
2) The graph looks kind of Gaussian with a mean value of 6.7342.
3) From the above graph, we can say that there are around 80% of a feature whose mean lies between 6.5 and 7.0

Similarly, we plot the distribution of mean values per col in data and got the following graph.

Following observation is made after looking at the mean graph for col values of data
1) The above graph is of the mean distribution of each feature column-wise.
2) The columns wise mean distribution is not gaussian.
3) The majority of columns having the mean value between -10 and 20.

Similarly, I plot the distribution of standard deviation along the row and col of data and got the following plot and observations respectively.

Distribution of std values per row in the data

The observation was as follows:
1) we can see that the standard deviation distribution of each feature along the row also kind of follow the Gaussian distribution not
exactly but as per the shape of the curve.
2) Around 60% of features having the standard deviation around in the range of 9.3–10.

Distribution of std values per col in the data

Observations
1) As the graph, we can say the distribution of the standard deviation along with features along the column comes from some other distribution.
2) There are a large number of features having a deviation in the range of 0 & 6.
3) With a minimum standard deviation of 0 and a maximum of around 21 or something.

Distribution of the mean value dataset, grouped by the value of a target.
Here I have done the above EDA one more time but with some twist, I have grouped mean by the value of a target.

Distribution of mean values per row in the dataset grouped by the value of the target

Observations:
1) The above graph shows the mean distribution of every features towards the target.
2) The distribution of every feature towards each class looks kind of similar.
3) Hence the features will do well on identifying the target class.

Distribution of mean values per col in the data set grouped by the value of the target

Observation:
1) Looking at the above graph shows both the distribution quite similarly.
2) All features will do we in identifying the target class.
3) The majority of features means lies in the range of -10 to 20.

Distribution of min and max value in data as row and col both wise

Distribution of min values per row in the data

Observation is as follows:
1) The above graph shows the distribution of the min values of each feature.
2) The plot looks skewed on the right side.
3) Majority of features having the min values in the range of -40 to -20.

Distribution of min values per col in the data

Observations:
1) The above graph is the column-wise min value distribution of each feature.
2) We have observed the lower value i.e -80 as the long queue is at the lower side.

Distribution of max value per row in the data

Observations:
1) The above distribution is the row-wise distribution of each feature max value.
2) We can observe the max value of 70 on the right as the long tail is on the right of the graph.
3) The graph is skewed towards the left as the long tail is on the right side.

Distribution of max values pr col in the data

Observation:
1) The above distribution is the col wise distribution of each feature max value.
2) We can observe the max value of 80 as the long tail of the graph is on the right side.

Now let’s check the correlation between the features
So far all my EDA is based on the graphical distribution so now since the data contains numerical features correlation between the features should be checked. So let’s do it.

features = data_train.columns.values[1:200] # Here we are getting all th features
# And here we are calculating the correlation of every featre ad sorting it in ascending order
cor_data = data_train[features].corr().abs().unstack().sort_values(kind=”quicksort”).reset_index() 
cor_data = cor_data[cor_data[‘level_0’] != cor_data[‘level_1’]]

Training a model without any feature engineering

Herewith correlation I have ended my EDA but you guys can do some more to find out more interesting information about the data. One way to know what is the baseline for your model should be you can train any model on the raw data without doing any feature engineering. And in this section of the blog, I am doing exactly the same thing.

start = timeit.default_timer()
# %%time
lgb_model = lgb.LGBMClassifier(boosting_type= ‘gbdt’, objective=’binary’,feature_fraction=0.05,
 class_weight=’balanced’,num_leaves=8,n_estimators=2000,max_depth=5,
 learning_rate=0.05,metric=’auc’,bagging_fraction=0.4, n_jobs=-1)
lgb_model.fit(X_train, y_train)predict_y_train = batch_predict(lgb_model, X_train)
predict_y_cv = batch_predict(lgb_model, X_cv)train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, predict_y_train)
test_fpr, test_tpr, te_thresholds = roc_curve(y_cv, predict_y_cv)
plt.plot(train_fpr, train_tpr, label=”train AUC =”+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label=”test AUC =”+str(auc(test_fpr, test_tpr)))
plt.legend()
# plt.xlabel(“K: hyperparameter”)
plt.ylabel(“AUC”)
plt.title(“ERROR PLOTS”)
plt.grid()
plt.show()
stop = timeit.default_timer()
print(‘Time in mins: ‘,(stop — start)/60)

Observation
The test AUC of 89.52% achieved without any feature engineering I aim to increase it further using some interesting feature engineering.

Feature Engineering

From the EDA we can see that the mean, std, min, max, sum, skew, kurtosis, the median can be used as a good feature

%time
# https://www.youtube.com/watch?v=LEWpRlaEJO8
idx = features = data.columns.values[2:202]
for dataFrame in [data_test, data_train]:
 dataFrame[‘sum’] = dataFrame[idx].sum(axis=1) 
 dataFrame[‘min’] = dataFrame[idx].min(axis=1)
 dataFrame[‘max’] = dataFrame[idx].max(axis=1)
 dataFrame[‘mean’] = dataFrame[idx].mean(axis=1)
 dataFrame[‘std’] = dataFrame[idx].std(axis=1)
 dataFrame[‘skew’] = dataFrame[idx].skew(axis=1)
 dataFrame[‘kurt’] = dataFrame[idx].kurtosis(axis=1)
 dataFrame[‘med’] = dataFrame[idx].median(axis=1)

One more interesting feature could be the rounding of the values of each feature.

%%time
# https://www.geeksforgeeks.org/numpy-round_-python/
features_value = [col for col in data_train.columns if col not in [‘ID_code’,’target’]]
# In this we ar rounding of the value of each columns and creating a new freature of the same 
for feature in features_value:
 data_train[‘round_2’+ feature] = np.round(data_train[feature],2)
 data_test[‘round_2’+ feature] = np.round(data_test[feature],2)
 data_train[‘round_1’+ feature] = np.round(data_train[feature],1) 
 data_test[‘round_1’+ feature] = np.round(data_test[feature],1)

After doing feature engineering we manage to increase the features up to 626.

Final Model Training

Once the new features are created its time to train a model and see what is the value of the AUC and did I manage to increase it more than 89.52%.

parameter = {
 ‘bagging_freq’: 5,
 ‘bagging_fraction’: 0.4,
 ‘boost_from_average’:’false’,
 ‘boost’: ‘gbdt’,
 ‘feature_fraction’: 0.05,
 ‘learning_rate’: 0.01,
 ‘max_depth’: -1, 
 ‘metric’:’auc’,
 ‘min_data_in_leaf’: 80,
 ‘min_sum_hessian_in_leaf’: 10.0,
 ‘num_leaves’: 13,
 ‘num_threads’: 8,
 ‘tree_learner’: ‘serial’,
 ‘objective’: ‘binary’, 
 ‘verbosity’: 1
}
# https://www.kaggle.com/adrianlievano/light-gbm-with-stratified-kfold
# Getting all the features name except the ID_code , target 
features = [col for col in data_train_final_.columns if col not in [‘ID_code’, ‘target’]]
# Inititalizing the K-Fold object
K_folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000) 
# this creates the empty numpy array of length of x in which we store the prediction of every validation data
val_pred = np.zeros(len(x))
# In this we keep the predicted output of the test data 
predictions_test = np.zeros(len(data_test_final_))
#In this loop we are doing the training and prediction for each folds and we are getting the train and valid data 
# using the trn_idx and val_idx 
for n_fold, (trn_idx, val_idx) in enumerate(K_folds.split(x.values, y.values)):
 print(“Fold {}”.format(n_fold))
 # Getting the train and validation data from the x data
 train_data = lgb.Dataset(x.iloc[trn_idx][features], label=y.iloc[trn_idx]) 
 valid_data = lgb.Dataset(x.iloc[val_idx][features], label=y.iloc[val_idx])
 # Here we are training lightgbm model on train and valid dataset
 num_round = 1000000
 classifier = lgb.train(parameter, train_data, num_round, 
 valid_sets = [train_data, valid_data],
 verbose_eval=1000, early_stopping_rounds = 3000)
 # Here we are doing the prediction on the valid data 
 val_pred[val_idx] = classifier.predict(x.iloc[val_idx][features], num_iteration=classifier.best_iteration)
 # And here we are doing the prediction on the test data
 predictions_test += classifier.predict(data_test_final_[features], 
 num_iteration=classifier.best_iteration) / K_folds.n_splits
print(“CV score: {:<8.5f}”.format(roc_auc_score(y, val_pred)))

As you can see the AUC score of 90% is achieved hence its time to do the submission on kaggle and check the score there.

Creating the submission CSV

sub_df = pd.DataFrame({“ID_code”:data_test[“ID_code”].values})
sub_df[“target”] = predictions
sub_df.to_csv(“submission.csv”, index=False)

Result’s

After creating the submission file its time for the test on kaggle and see the result. so keep your finger crossed.

As you can see I managed to achieve the AUC of 0.90041 on the kaggle test data. And as we know in machine learning no solution is a perfect solution there is always a scope of improvement in every model my solution can also be improved further.

Scope of improvement

You can improve this solution further by incorporating some other feature engineering one of them is frequency encoding and by removing the synthetic test data thank’s to this awesome technique shared by YaG320 in his kernel https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split). You can also perform data augmentation and improve your AUC by up to 92%.