A machine learning approach to identifying customers of Bank Of Portugal who would subscribe to a term deposit.

Nahom Demessie
Analytics Vidhya
Published in
12 min readAug 29, 2020

--

Oil has reigned for centuries as one of society’s most valuable resources. Throughout history, those who have controlled oil, have controlled the economy. However, in today’s “data economy,” it can be argued that data, due to the insight and knowledge that can be extracted from it, is potentially more valuable.¹

Photo by Damir Spanic on Unsplash

The finance industry is among the top industries exploiting the value of big data. As there has been heightened interest of marketing managers to carefully tune their directed campaigns to the rigorous selection of contacts, the Bank of Portugal wants to find a model that can predict which future clients would subscribe to their term deposit. Having such an effective predictive model can help increase their campaign efficiency as they would be able to identify customers who would subscribe to their term deposit and thereby direct their marketing efforts to them. This would help them better manage their resources.

The goal of this project is to come up with such an effective predictive model by using the data collected from customers of the Bank of Portugal.

Description of the Data set

The Bank of Portugal has collected a huge amount of data that includes customers profiles of those who have subscribed to term deposits and the ones who did not subscribe to a term deposit. The data includes the following columns.

1 — age (numeric)

2 — job: type of job (categorical: ‘admin.’,’blue-collar’,’ entrepreneur’,’ housemaid’,’ management’,’ retired’,’ self-employed’,’ services’,’ student’,’technician’,’ unemployed’,’ unknown’)

3 — marital: marital status (categorical: ‘divorced’,’ married’,’ single’,’ unknown’; note: ‘divorced’ means divorced or widowed)

4 — education (categorical): ‘basic.4y’,’basic.6y’,’basic.9y’,’ high.school’,’ illiterate’,’professional.course’,’ university. degree’,’ unknown’)

5 — default: has credit in default? (categorical: ‘no’,’yes’,’ unknown’)

6 — housing: has a housing loan? (categorical: ‘no’,’yes’,’ unknown’)

7 — loan: has a personal loan? (categorical: ‘no’,’yes’,’ unknown’)

The following are related to the last contact of the current campaign

8 — contact: contact communication type (categorical: ‘cellular’,’telephone’)

9 — month: last contact month of the year (categorical: ‘Jan’, ‘Feb’, ‘mar’, …, ‘Nov’, ‘Dec’)

10 — day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)

11 — duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# other attributes:

12 — campaign: number of contacts performed during this campaign and for this client (numeric, includes the last contact)

13 — days: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means the client was not previously contacted)

14 — previous: number of contacts performed before this campaign and for this client (numeric)

15 — outcome: outcome of the previous marketing campaign (categorical: ‘failure’,’ nonexistent’,’ success’)

# The following are social and economic context attributes

16 — emp.var.rate: employment variation rate — quarterly indicator (numeric) 17 — cons.price.idx: consumer price index — monthly indicator (numeric)

18 — cons.conf.idx: consumer confidence index — monthly indicator (numeric)

19 — euribor3m: Euribor 3 month rate — daily indicator (numeric)

20 — nr.employed: number of employees — quarterly indicator (numeric) Output variable (desired target):

21 — y — has the client subscribed to a term deposit? (binary: ‘yes’,’ no’)

Description Of Classifiers to be used

XGBoost — is an implementation of gradient boosted decision trees designed for speed and performance.

Logistic Regression — is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.

Random Forest — are an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.

Multi-Layer Perceptron — is a class of feedforward artificial neural networks (ANN).

GitHub Link

https://github.com/nahomneg/Bank-Institution-Term-Deposit-Predictive-Model/

Data Processing Class

This class is concerned with all the preprocessing, data exploration(plotting) and Feature extraction of the
Bank of Portugal data.

Modelling Class

This class is concerned with creating classifiers, creating pipes, k-fold splitting, and comparison of the available classifiers.

Explanatory Data Analysis

Import Necessary Modules and Libraries

# import the neccesary libraries
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from sklearn import preprocessing
from itertools import islice
import seaborn as sns
from scipy.stats import binom
from scipy.stats import norm
import statsmodels.stats.api as sms
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
# import preprocessor and modelling classes
import data
from model import Model

Load the data set and create an instance of the pre-processing and modeling classes.

#load the data set
data_frame = pd.read_csv("dataset/bank-additional-full.csv",sep=';' , engine='python')
# initialize the modelling and preprocessing classes
processor = data.PreProcessing(data_frame)
model = Model()

Class Imbalance

Class Imbalance is a major problem in machine learning classification where there is a disproportionate ratio of observations in each class. In this context, many classification algorithms have low accuracy for the class with the lower ratio.

To find out if the data is a victim of class imbalance, I used the plot_target_imbalance() method from the PreProcessing class.

# use the plot_target_imbalance method from the PreProcessing class 
processor.plot_target_imbalance()

The above plot shows the data is suffering from class imbalance where the frequency of the ‘no’ class is 8 times more than the ‘yes’ class.

Outliers

To identify columns with outliers, I used the detect_outliers_boxplot() from the PreProcessing class and pass numerical columns to it.

processor.detect_outliers_boxplot(['duration','age','cons.conf.idx','euribor3m'])

The box plots of both classes overlap quite a lot for age. This show age is not such a separating column.

The difference b/n medians of yes and no for euribor3m shows that it is a separating class.

The above plots show duration and age have a significant number of outliers. I used the handle_outliers() method from the PreProcessing class to replace the outliers with their respective medians.

processor.handle_outliers(['duration','age','cons.conf.idx'])

Effect of categorical Variables on the target

Next, I visualized the effect of each categorical variable on the target variable (‘y’).

I passed some categorical variables to the plot_multiple_categorical_against_target() from the PreProcessing class.

# use the plot_multiple_categorical_against_target method of the PreProcessing class
# to plot the count of each categorical variable with the target as a hue
processor.plot_multiple_categorical_against_target(['poutcome','marital','loan','housing'])
processor.plot_single_categorical_against_target('education')
processor.plot_single_categorical_against_target('day_of_week')

The above plot shows the day of the week does not help very much when it comes to predicting the target variable.

Effect of Numerical variables on the target using Histograms.

Looking at the age plot, the number of people who subscribed to the campaign goes up between 25 and 40 of age, but not as much as those who didn’t subscribe.

Distribution Plots

Next, I studied the distribution of numerical columns by using the plot_distribution() method and passing it some numerical variables.

processor.plot_distribution(['age','duration','campaign','emp.var.rate','euribor3m'])

Correlation b/n numerical variables

The emp.var.rate, cons.price.idx, euribor3m, and nr.employed features have a very high correlation. The highest correlation is between euribor3m and nr.employed which is 0.95!

Feature Engineering

I extracted a new feature ‘Year’ from the month column by making use of assign_years() method from the PreProcessing class.

# create a new feature 'Year' using assign years method
processor.assign_years()
processor.data_frame.sample(10)

One Hot Encoding and Standardization

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Data standardization is the process of rescaling numerical columns so that they have a mean value of 0 and a standard deviation of 1. This also helps ML algorithms to do a better job.

I chose standardization over normalizing because the deviations of the dataset from the mean are somewhat compromised when using normalization, which tries to fit the data b/n 0 and 1.

Column Transformers and Pipe Lines

A recent addition to Scikit-learn is the Column Transformers class, which allows specifying different transformations per column, making preprocessing a lot less tedious (and a lot easier to read!). The transformations to be applied in this case are One Hot Encoding and Standardization.

As the name suggests, pipelines allow stacking multiple processes, the column transformer, and the machine learning model in this case into a single scikit-learn estimator. Pipelines have fit, predict, and score method just like any other estimator.

I used the get_column_transformer method of the PreProcessing class to get a column transformer which in turn is passed to the pipeline. I dropped duration because it determines the fate target variable which is not good for training machine learning models. The day of the week is also dropped because it doesn’t have a significant impact on the model.

#a column transformer that handles both one hot encoding and standardization
column_transformer = processor.get_column_transformer(categorical_columns,numerical_columns,['duration','day_of_week'])

Before creating a pipeline, I had to add classifiers to be passed to the pipeline alongside the column transformer. To add classifiers, I used the add_model() function of the Model class.

# add 4 classifier models to be compared
model.add_classifier(model.get_random_forest_classifier(),'Random Forest')
model.add_classifier(model.get_logistic_classifier(),'Logistic Regressor')
model.add_classifier(model.get_xgboost_classifier(),'XGBoost')

model.add_classifier(model.get_multi_perceptron(),'MLP')

#create a pipeline using the transformer and the above classifiers
model.create_pipes(column_transformer)

Splitting the Data, K-Fold, and Cross-Validation

Cross-Validation is an essential tool in the Data Scientist toolbox. It allows utilizing the data better. The training set is used to train the model, and the validation set is used to validate it on data it has never seen before. I used the following variations of K-fold to come up with training and validation sets.

K-fold involves randomly dividing the dataset into k groups or folds of approximately equal size. The first fold is kept for testing and the model is trained on the other k-1 folds. The process is repeated K times and each time a different fold or a different group of data points are used for validation.

When splitting the data into folds, each fold should be a good representative of the whole data. This is important when considering data suffering from class imbalance, like the one under consideration here.

Stratified K-Folds provide train/validation by preserving the percentage of samples for each class of the target.

I split the data into training, testing, and cross-validating sets in 64%,20%, and 16% respectively. The validation set is split from the training data which was initially 80%, using K-Fold. I used the model class to accomplish this.

#train_test split in 80:20 ratio
X_train,X_test,y_train,y_test= processor.train_test_split()
#use K-fold to split the training data in 80:20 ratio
kfold = model.get_cross_validation_splitter(is_strattified = False)

Model Comparison

I used the compare_models() method of the model class to compare the 4 classifiers using 5 folds and taking their average score based on different evaluation metrics.

I used LabelBinarizer to tell the classifier which class of the target variable (yes or no) is positive and negative. The compare_models method needs to know which class is positive and which one is negative to compare the classifiers based on recall, precision, and f1_score.

# Label Binarizer to identify which target class is positive
lb = preprocessing.LabelBinarizer()
y_train = np.array([number[0] for number in lb.fit_transform(y_train)])

Then I evaluated the models based on different evaluation metrics.

#compare the models based on roc_auc
models_k_fold = model.compare_models(kfold , X_train, y_train,scoring = 'roc_auc')
#compare the models based on accuracy
models_k_fold_accuracy = model.compare_models(kfold , X_train, y_train,scoring='accuracy')
#compare the models based on recall
models_k_fold_recall = model.compare_models(kfold , X_train, y_train,scoring='recall')
#compare the models based on precision
models_k_fold_precision = model.compare_models(kfold , X_train, y_train,scoring='precision')
#compare the models based on f1_score
models_k_fold_f1 = model.compare_models(kfold , X_train, y_train,scoring='f1')

The compare_models() method of the model class returns average scores (of the 5 folds) along with side the name of the classifier. I used the returned data to create a table that shows the scores of classifiers based on different metrics.

#create a performance for each criteria based on the different metrics
classifiers = models_k_fold['classifiers']
scores_auc = models_k_fold['scores']
scores_accuracy = models_k_fold_accuracy['scores']
scores_recall = models_k_fold_recall['scores']
scores_precision = models_k_fold_precision['scores']
scores_f1 = models_k_fold_f1['scores']
data = {'Classifier':classifiers , 'Auc':scores_auc, 'Accuracy': scores_accuracy,
'Recall':scores_recall, 'Precision':scores_precision }
#change the dictionary to data frame
performance_df = pd.DataFrame(data)
performance_df = performance_df.set_index('Classifier')
performance_df
Performance Summary using normal K-Fold
#create a stratified k-fold by passing True to is_strattified 
kfold = model.get_cross_validation_splitter(is_strattified = True)
Performance Summary using stratified K-Fold

The above tables summarize the average performance of each classifier based on the different metrics per K-Fold type. It is visible from the above table the scores of both K-Fold types are fairly the same but it is always better to rely on the Stratified K-Fold technique for datasets with high class imbalance like the one under consideration here.

Each of the metrics above is better suited for different situations.

Evaluation Metric Selection

  1. Accuracy — the number of correct predictions made as a ratio of all predictions made.
  2. ROC — calculated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The area under the curve is the ROC_Score.
  3. Recall — quantifies the number of true positives found
  4. Precision — quantifies the number of positive class predictions that actually belong to the positive class.

I chose ROC_AUC as the most reliable metric because of the imbalanced nature of the dataset. Had I used accuracy, even the worst classifiers would have had a high score, but at the end of the day, it is still only a bad classifier.

ROC analysis does not have any bias toward models that perform well on the majority class at the expense of the minority class — a property that is quite attractive when dealing with imbalanced data.

ROC is able to achieve this by looking into both the True positive rate (TPR) and False positive rate (FPR). We will get a high ROC if both the TPR and FPR are above the random line.

Looking at both tables, I observed that XGBoost is the best performer according to the ROC_AUC with a score of around 0.764.

Conclusion

The main goal of this project was to be able to predict which customers would subscribe to a term deposit for Bank Of Portugal. The project uses data collected by the Bank of Portugal that includes customers profiles of those who have subscribed to term deposits and the ones who did not subscribe to a term deposit.

The deep explanatory analysis has shown the dataset is a victim of high class imbalance. It also showed that day_of_week doesn’t help much in the prediction.

After the EDA, the dataset was preprocessed by using a one-hot encoder for categorical variables whereas, StandardScaler was used for numerical columns.

Then I created pipelines consisting of 4 classifiers to be compared by the use of K-Fold and cross-validation.

I chose ROC_AUC as the most reliable evaluation metric because of its tolerance to class imbalance. Based on the ROC_AUC score, the XGBoost classifier came out on top with a score of around 0.76.

Gratitude

I put this together as part of Batch 3 of 10academy.org training and I would like to show my gratitude towards all my fellow learners and staff for their support.

References

[1]: Therese Fauerbach. (April 16, 2017). More Valuable than Oil, Data Reigns in Today’s Data Economy

https://www.northridgegroup.com/blog/more-valuable-than-oil-data-reigns-in-todays-data-economy/#:~:text=Oil%20has%20reigned%20for%20centuries,it%2C%20is%20potentially%20more%20valuable.

--

--