Bank Institution Term Deposit Predictive Model

Overview

Business Need

You successfully finished up to your rigorous job interview process with Bank of Portugal as a machine learning researcher. The investment and portfolio department would want to be able to identify their customers who potentially would subscribe to their term deposits. As there has been heightened interest of marketing managers to carefully tune their directed campaigns to the rigorous selection of contacts, the goal of your employer is to find a model that can predict which future clients who would subscribe to their term deposit. Having such an effective predictive model can help increase their campaign efficiency as they would be able to identify customers who would subscribe to their term deposit and thereby direct their marketing efforts to them. This would help them better manage their resources (e.g human effort, phone calls, time)

The Bank of Portugal, therefore, collected a huge amount of data that includes customers profiles of those who have to subscribe to term deposits and the ones who did not subscribe to a term deposit. As their newly employed machine learning researcher, they want you to come up with a robust predictive model that would help them identify customers who would or would not subscribe to their term deposit in the future.

Your main goal as a machine learning researcher is to carry out data exploration, data cleaning, feature extraction, and developing robust machine learning algorithms that would aid them in the department.

Data and Features

The dataset should be downloaded from the UCI ML website and more details about the data can be read from the same website. From the website, you would find access to four datasets:

  1. Bank-additional-full CSV with all examples
  2. Bank-additional.csv with 10% of data examples
  3. Bank-full.csv
  4. Bank.csv with 10% of 17 inputs
This is the first five rows in the dataset
Continuation of the first five rows in the dataset

In the above dataset, the numerical variables are,

age, duration,campaign, pdays, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed 

And the categorical variables are,

job, marital, education, default, housing, loan, contact, loan,day_of_week, poutcome, y

Importing Libraries:

#Importing the necessary libraries for data exploration
import pandas
import numpy as np
#Libraries for data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

Data Description:

# bank client data:

1 — age (numeric)

2 — job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)

3 — marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed)

4 — education (categorical): ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)

5 — default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)

6 — housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)

7 — loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)

# related with the last contact of the current campaign:

8 — contact: contact communication type (categorical: ‘cellular’,’telephone’)

9 — month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

10 — day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)

11 — duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# other attributes:

12 — campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 — pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 — previous: number of contacts performed before this campaign and for this client (numeric)

15 — poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)

# social and economic context attributes

16 — emp.var.rate: employment variation rate — quarterly indicator (numeric)

17 — cons.price.idx: consumer price index — monthly indicator (numeric)

18 — cons.conf.idx: consumer confidence index — monthly indicator (numeric)

19 — euribor3m: euribor 3 month rate — daily indicator (numeric)

20 — nr.employed: number of employees — quarterly indicator (numeric)

Output variable (desired target):

21 — y — has the client subscribed to a term deposit? (binary: ‘yes’,’no’)

Importing Dataset:

train = pd.read_csv("train.csv")
def read_csv_file(file,sep):
csv_file = pd.read_csv(file,sep)
return csv_file
bank_additional_full = read_csv_file(data_file, ";")

Let’s import the dataset using read_csv method and assign it to the variable ‘bank_additional_full’.

Identification of data types:

The .dtypes method to identify the data type of the variables in the dataset.

bank_additional_full.dtypes

Size of the dataset:

We can get the size of the dataset using the .shape method

shape_bank_additional_full = bank_additional_full.shape
print(f"The data has '(Row, Columns)':{shape_bank_additional_full}")

Statistical Summary of Numeric Variables:

Pandas describe() is used to view some basic statistical details like count, percentiles, mean, std and maximum value of a data frame or a series of numeric values. This gives the count of each variable.

bank_additional_full.describe()

To get the count of unique values:

The value_counts() method in Pandas returns a series containing the counts of all the unique values in a column. The output will be in descending order so that the first element is the most frequently-occurring element.

Let’s apply value counts to loan_default column

bank_additional_full['y'].value_counts()

Finding null values:

When we import our dataset from a CSV file, many blank columns are imported as null values into the Data Frame which can later create problems while operating that data frame. Pandas isnull() method is used to check and manage NULL values in a data frame.

bank_additional_full.apply(lambda x: sum(x.isnull()),axis=0)

We can see that there are no null values in the dataset.

Graphical Univariate Analysis:

Histogram:

Histograms are one of the most common graphs used to display numeric data. Histograms two important things we can learn from a histogram:

  1. distribution of the data — Whether the data is normally distributed or if it’s skewed (to the left or right)
  2. To identify outliers — Extremely low or high values that do not fall near any other data points.

Lets plot histogram for the ‘age’ feature in our dataset

plt.figure(figsize=(18,10))
sns.countplot(x='age', data=bank_additional_full)
plt.xlabel("Ages")
plt.ylabel("Counts")
plt.title("Counts of Ages in the Dataset")

Here, the distribution is skewed to the right.

Count Plots:

A count plot can be thought of as a histogram across a categorical, instead of numeric, variable. It is used to find the frequency of each category.

Count to see if the clients have subscribed to a term deposit:

plt.figure(figsize=(12,8))
sns.countplot(x='education', data=bank_additional_full)
plt.xlabel("Education Status")
plt.ylabel("Counts")
plt.title("Education Status Counts")

Here, we can see that University Degree has the highest number of counts in the Education dataset.

Count of Ages in the dataset:

Marital status count in the dataset:

Box Plots:

A Box Plot is the visual representation of the statistical summary of a given data set.

plt.figure(figsize=(10,5))
plot = sns.boxplot(x = "education", y = "age", hue = "y", data= bank_additional_full)
plt.xticks( rotation=45, horizontalalignment='right' )
plot.set_title('Box plot of Education and Age')
plt.xlabel("Education")
plt.ylabel("Age")

References:

  1. Introduction to Tableau
  2. Outlier Detection Algorithms in Python
  3. Multilayer Perceptron
  4. SVM
  5. Xgboost
  6. RandomForest
  7. https://arxiv.org/pdf/1503.06410.pdf

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store