Using Machine Learning to Predict Subscription to Bank Term Deposits for Clients with Python

Bank Marketing with Machine Learning using Scikit-Learn

Emeka Efidi

Published in

The Startup

18 min readMay 15, 2020

“No great marketing decisions have ever been made on qualitative data.” — John Sculley (CEO of Apple Inc.)

Credit: wpforms.com/best-google-analytics-plugins-for-wordpress/

Introduction

Marketing to potential clients has always been a crucial challenge in attaining success for banking institutions. It’s not a surprise that banks usually deploy mediums such as social media, customer service, digital media and strategic partnerships to reach out to customers. But how can banks market to a specific location, demographic, and society with increased accuracy? With the inception of machine learning - reaching out to specific groups of people have been revolutionized by using data and analytics to provide detailed strategies to inform banks which customers are more likely to subscribe to a financial product. In this project on bank marketing with machine learning, I will explain how a particular Portuguese bank can use predictive analytics to help prioritize customers which would subscribe to a bank term deposit.

In this project I will demonstrate how to build a model predicting clients subscribing to a term deposit using the following steps -

Project definition
Data exploration
Feature engineering
Building training/validation/test samples
Model selection
Model evaluation

You can see my code in the Jupyter Notebook provided on my GitHub (https://github.com/emekaefidi/Bank-Marketing-with-Machine-Learning).

This project was inspired by Andrew Long!(check him out- https://towardsdatascience.com/@awlong20).

Project Definition

Predict if a client will subscribe (yes/no) to a term deposit — this is defined as a classification problem.

Data Exploration

The data that is used in this project originally comes from the UCI machine learning repository (link). The data is related with over 40,000 direct marketing campaigns of a Portuguese banking institution from May 2008 to November 2010. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

In this project, we are going to utilize python to develop a predictive machine learning model! Let’s begin by loading our data and exploring the columns.

Looking briefly at the data columns, we are can see that there are various numerical and categorical columns! These columns can be explained in more details below:

The most important column here is y, which is the output variable (desired target): this will tell us if the client subscribed to a term deposit(binary: ‘yes’,’no’).

Now let’s define an output variable to use for our binary classification. We will try to predict if a client is likely to subscribe to a term deposit.

Let’s define a function in order to calculate the prevalence of population that subscribes to a term deposit.

Here we see that around 11% of the population has a term deposit. This is known as an imbalanced classification problem so we will address that below.

From digging deeply to analyze the columns, we see there are a mix of categorical (non-numeric) and numerical data. A few things to note —

All the data inputted are non-null values, meaning that we have a value for every column
age, duration, campaign, pdays, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m and nr.employed are numerical variables
default, housing and loan have 3 values each (yes, no and unknown)
Output (y) has two values: “yes” and “no”
We are discarding duration. This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call yis obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model

Feature Engineering

In this section, We are going to create features for our machine learning model. In each section, we will add new variables to the dataframe and keep track of which columns of the dataframe we are going to engage as part of the features for the predictive model. We will divide this section into numerical and categorical features.

Numerical Features

These are numeric data. The numerical columns that we will use can be seen below:

Now, let’s check if there are any missing values in the numerical data.

Categorical Features

Categorical variables are non-numeric data such as job and education. To turn these non-numerical data into variables, the simplest thing is to use a technique called one-hot encoding, which will be explained below.

The first set of categorical data we will work on are these columns:

In one-hot encoding, we will create a new column for each unique value in that column. Now, the value of the column is 1 if the sample has that unique value or else 0 . For example, for the column job, we would create new columns (“job_blue-collar”, “job_entrepreneur”, etc). If the client’s job is blue-collar, the client gets a 1 under ‘job_blue-collar’ and 0 under the rest of the job columns. To create these one-hot encoding columns, we will utilize the get_dummies function provided by pandas.

A problem that arises is by creating a column for each unique value, we have correlated columns. That is to say, the value in one column can be figured out by looking at the rest of the columns. For example, if marital is not “married”, “single”, or “divorced”, it must be “unknown”. In order to fix this, we can use the drop_first option, which will drop the first categorical value for each column. Now we are ready to make all of our categorical features.

In order to add the one-hot encoding columns to the dataframe, we use the concatfunction. axis = 1 is used to add the columns.

Let’s now save the column names of the categorical data to keep track of them.

Feature Engineering: Summary

Through this process we created 62 features for the machine learning model! We separated the features to the following:

9 numerical features
53 categorical features

We will create a new dataframe that only has the features and the OUTPUT_LABEL

Building Training/Validation/Test Samples

Till this very point we have explored our data and created features from the categorical data. Now, It is now time for us to split our data. The reason why we split the data is so that you can measure how well your model would do on unseen data. We split into three parts:

Training samples: these samples are used to train the model
Validation samples: these samples are held out from the training data and are used to make decisions on how to improve the model
Test samples: these samples are held out from all decisions and are used to test (measure) the generalized performance of the model

In this project, we will split into 70% train, 15% validation, and 15% test!

Let’s shuffle the samples using sample in case there was some order (e.g. all positive samples on top). Here n is the number of samples. random_state is just specified so the project is reproducible.

We can use sample again to extract 30% (using frac) of the data to be used for validation and test splits. An important note is that the validation and test come from similar distributions and this technique is one way to do it.

And now we can split into test and validation using 50% fraction.

The .drop function just drops the rows from df_test to get the rows that were not part of the sample. We can use this same idea to get the training data.

At this junction, let’s check what percent of our groups are likely to subscribe to a term deposit. This is known as prevalence. Ideally, all three groups would have similar prevalence.

Now we can see that the prevalence is about the same for each group.

At this point, we might suggest dropping the training data into a predictive model and see the outcome. However, if we do this, there’s a chance that we will get back a model that is 89% accurate. But wait, we never caught any of the clients that will subscribe to a term deposit(recall= 0%). How can this be possible?

What is happening is that we have an imbalanced dataset where there are much more negatives than positives, therefore the model might just assign all samples as negative.

Typically, it is best practice to balance the data in some way to give the positives more weight. There are 3 techniques that are typically utilized:

sub-sample the more dominant class: using random subset of the negatives
over-sample the imbalanced class: using the same positive samples multiple times
create synthetic positive data

Usually, you will want to use the latter two methods if you only have a handful of positive cases. Since we have a few thousand positive cases, let’s use the sub-sample approach. Here, we will create a balanced training, validatoin and test data set that has 50% positive and 50% negative. You can also try tweaking this ratio to see if you can get an improvement.

Most machine learning packages like to implement an input matrix X and output vector y, so let’s create those:

There can be troubles in machine learning models when the variables are of different size (0–100, vs 0–1000000). To combat this, we can scale the data. Here we will use scikit-learn’s Standard Scalerwhich removes the mean and scales to unit variance. Here I will create a scaler using all the training data, but you could also use the balanced one if you wanted.

We are going to need this scaler for the test data, so let’s save it using a package called pickle.

Now we can go ahead and transform our data matrices:

We won’t transform the test matrix yet, to prevent us from being tempted to look at the performance until we are done with model selection.

Model Selection

Fantastic! We had to do a lot of work to prep the data! Which is the norm in data science. You can spend up to 90% cleaning and preparing data before analyzing!

In this section, we train a few machine learning models and use a few techniques for optimizing them. We will then select the best model based on performance on the validation set.

We will utilize the following functions to evaluate the performance of the model — AUC (Area Under the ROC Curve), Accuracy, Recall, Precision, Specificity and F1!

Since we have a balanced training data, let’s set our threshold at 0.5 to label a predicted sample as positive.

Model Selection: Baseline models

In this section, we will first compare the model performance of the following 7 machine learning models using default hyperparameters:

K-Nearest Neighbors
Logistic Regression
Stochastic Gradient Descent
Naive Bayes
Decision Tree
Random Forest
Gradient Boosting Classifier

K Nearest Neighbors (KNN)

KNN is one the simplest machine learning models. KNN looks at the k closest datapoints and probability sample that has positive labels. This model is very easy to understand, versatile, and you don’t need an assumption for the data structure. KNN is also good for multivariate analysis. A caveat with this algorithm is being sensitivity to K and takes a long time to evaluate if the number of trained samples is large. We can fit KNN using the following code from scikit-learn:

We can evaluate the model performance with the following code:

To be brief, we will exclude the evaluation from the remaining models and only show the aggregated results below.

Logistic Regression

Logistic regression is a traditional machine learning model that fits a linear decision boundary between the positive and negative samples. Logsitic regression uses a line (Sigmoid function) in the form of an “S” to predict if the dependent variable is true or false based on the independent variables. One advantage of logistic regression is the model is interpretable — we know which features are important for predicting positive or negative. Take note that the modeling is sensitive to the scaling of the features, so that is why we scaled the features above. We can fit logistic regression using the following code from scikit-learn.

Stochastic Gradient Descent

Stochastic gradient descent is similar to logistic regression. Stochastic Gradient Descent analyzes various sections of the data instead of the data as a whole and predicts the output using the independent variables. Stochastic Gradient Descent is faster than logistic regression in the sense that it doesn’t run the whole dataset but instead looks at different parts of the dataset. We can fit stochastic gradient descent using the following code from scikit-learn.

Naive Bayes

Naive Bayes is a model traditionally used in machine learning. This algorithm uses Bayes rule which calculated the probability of an event related to previous knowledge of the variables concerning the event. The “Naive” part is that the model assumes that all variables in the dataset are independent of each other — meaning there are no dependent variables or output. This works well for robotics and computer vision, but we can also try it here! We can fit Naive Bayes with the following code.

Decision Tree

Another class of popular machine learning models is tree-based methods. The simplest tree-based method is known as a decision tree. The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules gotten from training data. In Decision Trees, for predicting a class label for a record we start from the root of the tree. One advantage of tree-based methods is that they have no assumptions about the structure of the data and are able to pick up non-linear effects if given sufficient tree depth. We can fit decision trees using the following code.

Random forest

One disadvantage of decision trees is that they tend overfit very easily by memorizing the training data. Overfitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Random forests were created to reduce the overfitting. In random forest models, multiple trees are created and the results are aggregated. The trees in a forest are decorrelated by using a random set of samples and random number of features in each tree. In most cases, random forests work better than decision trees because they are able to generalize more easily. To fit random forests, we can use the following code.

Gradient Boosting Classifier

Boosting is a technique that builds a new decision tree algorithm that focuses on the errors on the dataset to fix them. This learns the whole model in other to fix it and improve the prediction of the model. A model that uses this technique combined with a gradient descent algorithm (controlling learning rate) is known as gradient boosting classifier. One advantage is the XGBoost library is the determining factor in winning a lot of Kaggle data science competitions! To fit the gradient boosting classifier, we can apply the following code.

Analysis of Baseline Models

The next step is to make a dataframe with the results of all the baseline models and plot the outcomes using a package called seaborn. We will utilize the AUC to evaluate the best model. This is a good data science performance metric for picking the best model since it captures the trade off between the true positive and false positive and does not require selecting a threshold.

As we can see most of the models (except decision tree) have similar performance on the validation set. There is some overfitting as noted by the drop between training and validation. Let’s check if we can improve this performance using a few more techniques.

Model Selection: Learning Curve

In this section, we can diagnose how our models are doing by plotting a learning curve. In this section, we will make use of the learning curve code from scikit-learn’s website with a small change of plotting the AUC instead of accuracy.

In the case of random forest, we can see the model has high variance because the training and cross-validation scores show data points which are very spread out from one another. High variance would cause an algorithm to model the noise in the training set (overfitting).

Depending on the learning curve, there are a few strategies we can employ to improve the models

High Variance:
- Reduce number of features
- Decrease model complexity
- Add regularization
- Add more samples

High Bias:
- Add new features
- Increase model complexity
- Reduce regularization
- Change model architecture

Model Selection: Feature Importance

A way of improving your models to understand what features are important to your models. This can usually only be investigated for simpler models such as Logistic Regression or Random Forests. This analysis can help in certain areas:
— inspire new feature ideas : assists with both high bias and high variance
— obtain a list of the top features to be used for feature reduction: helps with high variance
— point out errors in your pipeline: helps with robustness of model

We can get the feature importance from logistic regression using the below.

We can take a look at the top 50 positive and top 50 negative coefficients to get some insight.

After reviewing these charts, I realized the features that have more impact on the predictive outcomes of the model are cons.price.idx, and euribor3m due to their high importance score. cons.price.idx is the consumer price index which measures changes in the price level of a weighted average market basket of consumer goods and services purchased by households. A lower the price index will encourage clients to subscribe to a term deposit. Similarly, euribor3m is the Euribor (Euro InterBank Offered Rate) which is the average interest rate banks provide on short term loans (3 months). This is a metric that shows clients’ ability to pay off short terms loans.

In a high variance situation, a technique that can be used is to reduce the number of variables to minimize overfitting. After this analysis, you could apply the top N positive and negative features or the top N important random forest features. You might need to adjust N so that your performance does not drop drastically. An example is only using the top feature will likely drop the performance by a lot.

Feature importance plots may also alert errors in your predictive learning model. You may have some data leakage in the cleaning process. Data leakage can be explained as the process of accidentally including something in the training that allows the machine learning algorithm to artificially cheat. Similar things can also happen when you combine datasets. Supposing when you merged the datasets one of the classes ended up with nan for some of the variables.

Model Selection: Hyperparameter Tuning

Hyperparameter Tuning is the process of searching for the ideal model architecture. These are parameters which define the model architecture. We are only going to optimize the hyper parameters for stochastic gradient descent, random forest, and gradient boosting classifier. We will not optimize KNN since it took a while to train. We will not optimize logistic regression since it performs similarly to stochastic gradient descent. Similarly, We will not optimize decision trees since they tend to overfit and perform worse that random forests and gradient boosting classifiers.

A good tool for hyperparameter tuning is Grid search — where grid values are tested using all possible combinations. This is a computationally intensive method. Another option is to randomly test a permutation of them. This technique is called Random Search and is also deployed in scikit-learn.

Now, we can create a grid over the random forest hyperparameters.

To implement the RandomizedSearchCV function, we need something to score or evaluate a set of hyperparameters. Here we will use the AUC.

The three important parameters of RandomizedSearchCV are

scoring = evaluation metric used to pick the best model
n_iter = number of different combinations
cv = number of cross-validation splits

Note that increasing the last two of these will increase the run-time, but will decrease chance of overfitting. The number of variables and grid size also influences the runtime. Cross-validation is a method for splitting the data multiple times to get a better estimate of the performance metric. For the purposes of this project, we will limit to 2 CV to reduce the time.

Let’s fit our Randomized Search random forest with the following code.

We can analyze the performance of the best model compared to the baseline model.

In the same way,we can optimize the performance of stochastic gradient descent and gradient boosting classifiers.

Here We can aggregate the results and compare to the baseline models on the validation set.

Looking at the results, we can see that the hyperparameter tuning improved the models, but not by much. This is most likely due to the fact that we have a high variance situation.

Model Selection: Best Classifier

In this phase, we will chose the gradient boosting classifier since it has the best AUC on the validation set. You won’t want to train your best classifier every time you want to run new predictions. Therefore, we need to save the classifier. We will use the package pickle.

Model Evaluation

Now that we have chosen our best model (optimized gradient boosting classifier). Let’s evaluate the performance of the test set.

Lastly, The final evaluation is shown below!

Additionally, We can create the ROC curve for the 3 datasets as shown below:

Conclusion

Through this project, we created a machine learning model that is able to predict how likely clients will subscribe to a bank term deposit. The best model was gradient boosting classifier with optimized hyperparameters. Our model’s test performance (AUC) is 79.5%. A precision of 0.82 divided by a prevalence of 0.50 gives us 1.6, which means the model helps us 1.6 times better than randomly guessing. The model was able to catch 62% of customers that will subscribe to a term deposit. We should focus on targeting customers with high cons.price.idx (consumer price index) and euribor3m (3 month indicator for paying off loans) as they are high importance features for the model and business. Therefore, we save time and money knowing the characteristics of clients we should market to and that will lead to increased growth and revenue.

References

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22–31, June 2014

A. Long. Using Machine Learning to Predict Hospital Readmission for Patients with Diabetes with Scikit-Learn. October 2018