Predicting the success of bank telemarketing using logistic regression

Nafiu
7 min readSep 18, 2022

--

Photo by Boitumelo Phetla on Unsplash

The goal of this article is to create a binary classification model using logistic regression to predict if the client will subscribe to a bank’s term of deposit. (which is the variable y of the dataset) in a simple way. The dataset we are gonna use is Bank Telemarketing dataset which is publicly available at: https://archive.ics.uci.edu/ml/datasets/bank+marketing

Lets get started

importing the required libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

let’s load the data using pandas

# loading the data
data=pd.read_csv('dataset/bank-additional.csv',sep=";")
data.info()

As we can see this dataset contains 21 Attributes including the output/targeted variable. These Attributes are described below:

bank client data:

  1. age (numeric)
  2. job : type of job (categorical: “admin.”,”blue-collar”,”entrepreneur”,”housemaid”,”management”,”retired”,”self-employed”,”services”,”student”,”technician”,”unemployed”,”unknown”)
  3. marital : marital status (categorical: “divorced”,”married”,”single”,”unknown”; note: “divorced” means divorced or widowed)
  4. education (categorical: “basic.4y”,”basic.6y”,”basic.9y”,”high.school”,”illiterate”,”professional.course”,”university.degree”,”unknown”)
  5. default: has credit in default?
  6. housing: has housing loan?
  7. loan: has personal loan?

related with the last contact of the current campaign:

8. contact: contact communication type

9. month: last contact month of year

10. day_of_week: last contact day of the week

11. duration: last contact duration, in seconds (numeric)

other attributes:

12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14. previous: number of contacts performed before this campaign and for this client (numeric)

15. poutcome: outcome of the previous marketing campaign

social and economic context attributes

16. emp.var.rate: employment variation rate

17. cons.price.idx: consumer price index

18. cons.conf.idx: consumer confidence index

19. euribor3m: euribor 3 month rate

20. nr.employed: number of employees

Output variable (desired target):

21. y — has the client subscribed a term deposit? (binary: “yes”,”no”)

Well back to code, let’s check the number of unique items in the categorical columns. for this we are gonna create a function to output value_count as a table and pass all the categorical variables through the function.

# creating the function to print value count as table
def plot_value_count_table(data, feild):
print(feild)
fig, ax = plt.subplots()
fig.tight_layout()
fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')
df = pd.DataFrame({
'index': data[feild].value_counts().index,
'values': data[feild].value_counts().values,
})
table = ax.table(cellText=df.values, colLabels=['index', 'value'], loc='center')
plt.show()
# calling the function
for i in data.select_dtypes('object').columns.to_list():
plot_value_count_table(data, i)
# this will print the value count as tables
screenshot of the output

Let’s start with the process of data analysis and cleaning. Were we are gonna perform various functions on the dataset to deal with some arbitrary values in the dataset.

First let’s look for any missing values.

data.isna().sum().sum()

Here we can see the number of missing values is 0 in the whole dataset. But if we look carefully at the above value count result we can see some “unknown” values which are actually missing values. This is mentioned in the description as “several missing values in some categorical attributes, all coded with the ‘unknown’ label”. so we are gonna change all the ‘unknown’ value to NaN values and evaluate again.

data.replace('unknown', np.NaN, inplace=True)
data.isna().sum().sum()
# Now we can see the number of Nan values
# removing the values
data = data.dropna()
data.isna().sum().sum()

Check for any duplicated values and drop them if there is any

data.duplicated().sum()
# dataset doesn't have any duplicate values

For the education variable which has 7 unique values, there is a value name ‘illiterate’ which only appears once so we are gonna remove that single row and also we can see some similar values such as basic.9y, basic.4y, basic.6y which we are gonna convert all of them to ‘basic.school’. :)

The ‘pdays’ variable: according to the description provided about pdays the value 999 means client was not previously contacted. So we are gonna replace all the 999 values with 0.

data.loc[data['pdays'] == 999, 'pdays'] = 0
data['pdays'].value_counts()

Important note in the description

duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=”no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. So we are gonna drop this column

data = data.drop(["duration"], axis = 1)

Let’s get into visualization's and continue with the process

# plotting our target variable, which is the column 'y'
ax = sns.countplot(x = data["y"])
for p in ax.patches:
ax.annotate(f"{p.get_height()}", (p.get_x()+0.25, p.get_height()-200))
plt.show()

Now let visualize all the Categorical variables relative to your target variable known as ‘y’ to get a clear understanding of the relationship between two variable

cat = data.select_dtypes('object').columns.to_list() 
cat = cat[:-1]
fig, axis = plt.subplots(len(cat), figsize=(15, 50))
axs_cnt = 0
for i in cat:
axis[axs_cnt].set_title(i, fontdict={'fontsize':15})
sns.countplot(x = data[i], hue=data["y"], ax=axis[axs_cnt])
for p in axis[axs_cnt].patches:
axis[axs_cnt].annotate(f"{p.get_height()}", (p.get_x()+0, p.get_height()+12))

axs_cnt = axs_cnt+1
plt.subplots_adjust(hspace=0.5)
plt.show()
screenshot of the output

Looking at the ‘default’ plot we can see that, ‘default’ variable has no impact on the targeted result. so we will drop default from the dataset

data  = data.drop(["default"], axis = 1)

plot all the numerical variables relative to our target variable known as ‘y’ and also checking the count of the variables

cat = data.select_dtypes(['float64', 'int64']).columns.to_list()
fig, axis = plt.subplots(len(cat),2, figsize=(20,30))
axs_cnt = 0
for i in cat:
axs = axis[axs_cnt]
sns.boxplot(data=data, x="y", y=i, ax=axs[0])
axs[0].set_title(i, fontdict={'fontsize':15})
sns.histplot(data[i], ax=axs[1],kde=True)

axs_cnt = axs_cnt+1
plt.subplots_adjust(hspace=0.5)
plt.show()

and check if there is any correlated variable in the dataset

plt.figure(figsize=(8, 6))
mask = np.triu(np.ones_like(data.corr(), dtype=bool))
sns.heatmap(data.corr(), mask=mask, annot=True);
screenshot of the output

Here we can see that the variables emp.var.rate, euribor3m, nr.employed are very highly correlated. So we are going to keep only one of these variables and remove the other two. To decide which variable to keep we are gonna calculate VIF (Variance Inflation Factor) of the these 3 correlated variable and remove the 2 variables which has the highest VIF

calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_x = data[['emp.var.rate','euribor3m', 'nr.employed']]
# new dataframe
vif_data = pd.DataFrame()
vif_data["Variable"] = vif_x.columns

# calculating VIF
vif_data["VIF"] = [variance_inflation_factor(vif_x.values, i)for i in range(len(vif_x.columns))]

vif_data
screenshot of the output

The highest VIF is the variables ‘euribor3m’ and ‘nr.employed’ so we will remove these two variables and keep the ‘emp.var.rate’ variable.

emp.var.rate — employment variation rate (change of the employment rate)

data = data.drop(['euribor3m', 'nr.employed'], axis = 1)

Data processing

We are gonna make this a simple 4 step process.

  1. label encode all the Categorical variables
  2. split into X and Y
  3. split into train and test
  4. Standardizes the data

First we are gonna label encode all the Categorical variables this will convert all the variables into numeric form

from sklearn.preprocessing import LabelEncoder
d_types = dict(data.dtypes)
for name , type_ in d_types.items():
if str(type_) == 'object':
Le = LabelEncoder()
data[name] = Le.fit_transform(data[name])

Next we gonna split the encoded data into x and y, where y is our targeted variable and x is all others

x = data.iloc[:, 1:-1].values
y = data.iloc[:, -1].values

Next split the data into a subset of test and train data

from sklearn.model_selection import train_test_split
Xtrain , Xtest , Ytrain , Ytest = train_test_split(x, y, test_size = 0.2 , random_state = 4)

Before proceeding more, let’s check the shape of our data

screenshot of the output

Standardizes the data by using StandardScaler from sklearn. This is the last step of data processing before creating and training the model

from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
Xtrain = Scaler.fit_transform(Xtrain)
Xtest = Scaler.transform(Xtest)

Model

Let’s create the model and train it..:)

import logistic regression from sklearn

from sklearn.linear_model import LogisticRegression

Create and Train

model = LogisticRegression()
model.fit(Xtrain , Ytrain)

So far, we have cleaned the dataset, processed the dataset, created the model and trained. Next step is we evaluate our model

from sklearn.metrics import classification_report, accuracy_score

run prediction on the testing dataset

predictions = model.predict(Xtest)
report=classification_report(Ytest, predictions,output_dict=True)
report = pd.DataFrame(report).transpose()
report

also lets print the accuracy score separately

print(f"accuracy score: {accuracy_score(predictions, Ytest)*100}")
screenshot of the output

Here we can see we got an accuracy score of 90.77 which is very good.

Well this is the end of this practical guide.

link to github repo: https://github.com/nafiu-dev/Bank-Telemarketing-logistic-regression

--

--