Predicting the success of bank telemarketing using logistic regression
The goal of this article is to create a binary classification model using logistic regression to predict if the client will subscribe to a bank’s term of deposit. (which is the variable y of the dataset) in a simple way. The dataset we are gonna use is Bank Telemarketing dataset which is publicly available at: https://archive.ics.uci.edu/ml/datasets/bank+marketing
Lets get started
importing the required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
let’s load the data using pandas
# loading the data
data=pd.read_csv('dataset/bank-additional.csv',sep=";")
data.info()
As we can see this dataset contains 21 Attributes including the output/targeted variable. These Attributes are described below:
bank client data:
- age (numeric)
- job : type of job (categorical: “admin.”,”blue-collar”,”entrepreneur”,”housemaid”,”management”,”retired”,”self-employed”,”services”,”student”,”technician”,”unemployed”,”unknown”)
- marital : marital status (categorical: “divorced”,”married”,”single”,”unknown”; note: “divorced” means divorced or widowed)
- education (categorical: “basic.4y”,”basic.6y”,”basic.9y”,”high.school”,”illiterate”,”professional.course”,”university.degree”,”unknown”)
- default: has credit in default?
- housing: has housing loan?
- loan: has personal loan?
related with the last contact of the current campaign:
8. contact: contact communication type
9. month: last contact month of year
10. day_of_week: last contact day of the week
11. duration: last contact duration, in seconds (numeric)
other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign
social and economic context attributes
16. emp.var.rate: employment variation rate
17. cons.price.idx: consumer price index
18. cons.conf.idx: consumer confidence index
19. euribor3m: euribor 3 month rate
20. nr.employed: number of employees
Output variable (desired target):
21. y — has the client subscribed a term deposit? (binary: “yes”,”no”)
Well back to code, let’s check the number of unique items in the categorical columns. for this we are gonna create a function to output value_count as a table and pass all the categorical variables through the function.
# creating the function to print value count as table
def plot_value_count_table(data, feild):
print(feild)
fig, ax = plt.subplots()
fig.tight_layout()
fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')
df = pd.DataFrame({
'index': data[feild].value_counts().index,
'values': data[feild].value_counts().values,
})
table = ax.table(cellText=df.values, colLabels=['index', 'value'], loc='center')
plt.show()# calling the function
for i in data.select_dtypes('object').columns.to_list():
plot_value_count_table(data, i)# this will print the value count as tables
Let’s start with the process of data analysis and cleaning. Were we are gonna perform various functions on the dataset to deal with some arbitrary values in the dataset.
First let’s look for any missing values.
data.isna().sum().sum()
Here we can see the number of missing values is 0 in the whole dataset. But if we look carefully at the above value count result we can see some “unknown” values which are actually missing values. This is mentioned in the description as “several missing values in some categorical attributes, all coded with the ‘unknown’ label”. so we are gonna change all the ‘unknown’ value to NaN values and evaluate again.
data.replace('unknown', np.NaN, inplace=True)
data.isna().sum().sum()
# Now we can see the number of Nan values# removing the values
data = data.dropna()
data.isna().sum().sum()
Check for any duplicated values and drop them if there is any
data.duplicated().sum()
# dataset doesn't have any duplicate values
For the education variable which has 7 unique values, there is a value name ‘illiterate’ which only appears once so we are gonna remove that single row and also we can see some similar values such as basic.9y, basic.4y, basic.6y which we are gonna convert all of them to ‘basic.school’. :)
The ‘pdays’ variable: according to the description provided about pdays the value 999 means client was not previously contacted. So we are gonna replace all the 999 values with 0.
data.loc[data['pdays'] == 999, 'pdays'] = 0
data['pdays'].value_counts()
Important note in the description
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=”no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. So we are gonna drop this column
data = data.drop(["duration"], axis = 1)
Let’s get into visualization's and continue with the process
# plotting our target variable, which is the column 'y'
ax = sns.countplot(x = data["y"])
for p in ax.patches:
ax.annotate(f"{p.get_height()}", (p.get_x()+0.25, p.get_height()-200))
plt.show()
Now let visualize all the Categorical variables relative to your target variable known as ‘y’ to get a clear understanding of the relationship between two variable
cat = data.select_dtypes('object').columns.to_list()
cat = cat[:-1]
fig, axis = plt.subplots(len(cat), figsize=(15, 50))
axs_cnt = 0
for i in cat:
axis[axs_cnt].set_title(i, fontdict={'fontsize':15})
sns.countplot(x = data[i], hue=data["y"], ax=axis[axs_cnt])
for p in axis[axs_cnt].patches:
axis[axs_cnt].annotate(f"{p.get_height()}", (p.get_x()+0, p.get_height()+12))
axs_cnt = axs_cnt+1
plt.subplots_adjust(hspace=0.5)
plt.show()
Looking at the ‘default’ plot we can see that, ‘default’ variable has no impact on the targeted result. so we will drop default from the dataset
data = data.drop(["default"], axis = 1)
plot all the numerical variables relative to our target variable known as ‘y’ and also checking the count of the variables
cat = data.select_dtypes(['float64', 'int64']).columns.to_list()
fig, axis = plt.subplots(len(cat),2, figsize=(20,30))
axs_cnt = 0
for i in cat:
axs = axis[axs_cnt]
sns.boxplot(data=data, x="y", y=i, ax=axs[0])
axs[0].set_title(i, fontdict={'fontsize':15})
sns.histplot(data[i], ax=axs[1],kde=True)
axs_cnt = axs_cnt+1
plt.subplots_adjust(hspace=0.5)
plt.show()
and check if there is any correlated variable in the dataset
plt.figure(figsize=(8, 6))
mask = np.triu(np.ones_like(data.corr(), dtype=bool))
sns.heatmap(data.corr(), mask=mask, annot=True);
Here we can see that the variables emp.var.rate, euribor3m, nr.employed are very highly correlated. So we are going to keep only one of these variables and remove the other two. To decide which variable to keep we are gonna calculate VIF (Variance Inflation Factor) of the these 3 correlated variable and remove the 2 variables which has the highest VIF
calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_x = data[['emp.var.rate','euribor3m', 'nr.employed']]
# new dataframe
vif_data = pd.DataFrame()
vif_data["Variable"] = vif_x.columns
# calculating VIF
vif_data["VIF"] = [variance_inflation_factor(vif_x.values, i)for i in range(len(vif_x.columns))]
vif_data
The highest VIF is the variables ‘euribor3m’ and ‘nr.employed’ so we will remove these two variables and keep the ‘emp.var.rate’ variable.
emp.var.rate — employment variation rate (change of the employment rate)
data = data.drop(['euribor3m', 'nr.employed'], axis = 1)
Data processing
We are gonna make this a simple 4 step process.
- label encode all the Categorical variables
- split into X and Y
- split into train and test
- Standardizes the data
First we are gonna label encode all the Categorical variables this will convert all the variables into numeric form
from sklearn.preprocessing import LabelEncoder
d_types = dict(data.dtypes)
for name , type_ in d_types.items():
if str(type_) == 'object':
Le = LabelEncoder()
data[name] = Le.fit_transform(data[name])
Next we gonna split the encoded data into x and y, where y is our targeted variable and x is all others
x = data.iloc[:, 1:-1].values
y = data.iloc[:, -1].values
Next split the data into a subset of test and train data
from sklearn.model_selection import train_test_split
Xtrain , Xtest , Ytrain , Ytest = train_test_split(x, y, test_size = 0.2 , random_state = 4)
Before proceeding more, let’s check the shape of our data
Standardizes the data by using StandardScaler from sklearn. This is the last step of data processing before creating and training the model
from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
Xtrain = Scaler.fit_transform(Xtrain)
Xtest = Scaler.transform(Xtest)
Model
Let’s create the model and train it..:)
import logistic regression from sklearn
from sklearn.linear_model import LogisticRegression
Create and Train
model = LogisticRegression()
model.fit(Xtrain , Ytrain)
So far, we have cleaned the dataset, processed the dataset, created the model and trained. Next step is we evaluate our model
from sklearn.metrics import classification_report, accuracy_score
run prediction on the testing dataset
predictions = model.predict(Xtest)
report=classification_report(Ytest, predictions,output_dict=True)report = pd.DataFrame(report).transpose()
report
also lets print the accuracy score separately
print(f"accuracy score: {accuracy_score(predictions, Ytest)*100}")
Here we can see we got an accuracy score of 90.77 which is very good.
Well this is the end of this practical guide.
link to github repo: https://github.com/nafiu-dev/Bank-Telemarketing-logistic-regression