Robert Baker
33 min readMay 26, 2023

Data Analytics — A Bank Telemarketing Campaign Analysis

Robert Baker & Daniel Chapman

15 April 2023

The data presents the results of a telemarketing campaign of a Portuguese bank. Over 41,000 customers were contacted. The purpose of the campaign was to access if customers would subscribe to a new product (a bank term deposit account); ‘yes’ or ‘no’.

It is the goal of this report to develop classification models to predict if their customers will subscribe (yes/no) to a term deposit account (variable ‘y’).

import numpy as np
import pandas as pd
import pandas_profiling as pp
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb
import graphviz
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE#
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics, tree
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from pandas_profiling import ProfileReport
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import neural_network as nn
from sklearn.metrics import roc_curve, auc
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std
from sklearn.model_selection import GridSearchCV

Importing Dataframe

df = pd.read_csv("bank-additional-full.csv", sep = ';')
df.head()
png

Step 1 — Exploratory Data Analysis

A). Basic Information of the Data

# Produces a report of the different Data Types for each column.
print('Data Types of Columns (input Variables)')
df.dtypes
Data Types of Columns (input Variables)





age int64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
emp.var.rate float64
cons.price.idx float64
cons.conf.idx float64
euribor3m float64
nr.employed float64
y object
dtype: object
# Outlines the general shape of the dataframe (number of rows and columns).
print(f'No. of rows in the Data Frame - {df.shape[0]}')
print(f'No. of columns in the Data Frame - {df.shape[1]}')
No. of rows in the Data Frame - 41188
No. of columns in the Data Frame - 21
# Produces a Data Quality report for Numeric variables.
df.describe().transpose()
png
# Produces a Data Quality Report of the Categorical variables.
df.describe(include=['object', 'category']).transpose()
png
# Pandas Profiling - an auto EDA to get a quick overview of the data
# incl. stats, skewness, correlations etc
report = ProfileReport(df)
report
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]



Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]



Render HTML: 0%| | 0/1 [00:00<?, ?it/s]

B). Null-Values.

df.isnull().any()age               False
job False
marital False
education False
default False
housing False
loan False
contact False
month False
day_of_week False
duration False
campaign False
pdays False
previous False
poutcome False
emp.var.rate False
cons.price.idx False
cons.conf.idx False
euribor3m False
nr.employed False
y False
dtype: bool

There are no null values present in the data set.

#Provides a count of Non-null entities in each column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 41188 non-null int64
1 job 41188 non-null object
2 marital 41188 non-null object
3 education 41188 non-null object
4 default 41188 non-null object
5 housing 41188 non-null object
6 loan 41188 non-null object
7 contact 41188 non-null object
8 month 41188 non-null object
9 day_of_week 41188 non-null object
10 duration 41188 non-null int64
11 campaign 41188 non-null int64
12 pdays 41188 non-null int64
13 previous 41188 non-null int64
14 poutcome 41188 non-null object
15 emp.var.rate 41188 non-null float64
16 cons.price.idx 41188 non-null float64
17 cons.conf.idx 41188 non-null float64
18 euribor3m 41188 non-null float64
19 nr.employed 41188 non-null float64
20 y 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB

No null entities have been found in the DatFrame (df).

C). Unique Entities

for col in df:
print(col)
print(df[col].unique())
print('\n')
age
[56 57 37 40 45 59 41 24 25 29 35 54 46 50 39 30 55 49 34 52 58 32 38 44
42 60 53 47 51 48 33 31 43 36 28 27 26 22 23 20 21 61 19 18 70 66 76 67
73 88 95 77 68 75 63 80 62 65 72 82 64 71 69 78 85 79 83 81 74 17 87 91
86 98 94 84 92 89]


job
['housemaid' 'services' 'admin.' 'blue-collar' 'technician' 'retired'
'management' 'unemployed' 'self-employed' 'unknown' 'entrepreneur'
'student']


marital
['married' 'single' 'divorced' 'unknown']


education
['basic.4y' 'high.school' 'basic.6y' 'basic.9y' 'professional.course'
'unknown' 'university.degree' 'illiterate']


default
['no' 'unknown' 'yes']


housing
['no' 'yes' 'unknown']


loan
['no' 'yes' 'unknown']


contact
['telephone' 'cellular']


month
['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'mar' 'apr' 'sep']


day_of_week
['mon' 'tue' 'wed' 'thu' 'fri']


duration
[ 261 149 226 ... 1246 1556 1868]


campaign
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 19 18 23 14 22 25 16 17 15 20 56
39 35 42 28 26 27 32 21 24 29 31 30 41 37 40 33 34 43]


pdays
[999 6 4 3 5 1 0 10 7 8 9 11 2 12 13 14 15 16
21 17 18 22 25 26 19 27 20]


previous
[0 1 2 3 4 5 6 7]


poutcome
['nonexistent' 'failure' 'success']


emp.var.rate
[ 1.1 1.4 -0.1 -0.2 -1.8 -2.9 -3.4 -3. -1.7 -1.1]


cons.price.idx
[93.994 94.465 93.918 93.444 93.798 93.2 92.756 92.843 93.075 92.893
92.963 92.469 92.201 92.379 92.431 92.649 92.713 93.369 93.749 93.876
94.055 94.215 94.027 94.199 94.601 94.767]


cons.conf.idx
[-36.4 -41.8 -42.7 -36.1 -40.4 -42. -45.9 -50. -47.1 -46.2 -40.8 -33.6
-31.4 -29.8 -26.9 -30.1 -33. -34.8 -34.6 -40. -39.8 -40.3 -38.3 -37.5
-49.5 -50.8]


euribor3m
[4.857 4.856 4.855 4.859 4.86 4.858 4.864 4.865 4.866 4.967 4.961 4.959
4.958 4.96 4.962 4.955 4.947 4.956 4.966 4.963 4.957 4.968 4.97 4.965
4.964 5.045 5. 4.936 4.921 4.918 4.912 4.827 4.794 4.76 4.733 4.7
4.663 4.592 4.474 4.406 4.343 4.286 4.245 4.223 4.191 4.153 4.12 4.076
4.021 3.901 3.879 3.853 3.816 3.743 3.669 3.563 3.488 3.428 3.329 3.282
3.053 1.811 1.799 1.778 1.757 1.726 1.703 1.687 1.663 1.65 1.64 1.629
1.614 1.602 1.584 1.574 1.56 1.556 1.548 1.538 1.531 1.52 1.51 1.498
1.483 1.479 1.466 1.453 1.445 1.435 1.423 1.415 1.41 1.405 1.406 1.4
1.392 1.384 1.372 1.365 1.354 1.344 1.334 1.327 1.313 1.299 1.291 1.281
1.266 1.25 1.244 1.259 1.264 1.27 1.262 1.26 1.268 1.286 1.252 1.235
1.224 1.215 1.206 1.099 1.085 1.072 1.059 1.048 1.044 1.029 1.018 1.007
0.996 0.979 0.969 0.944 0.937 0.933 0.927 0.921 0.914 0.908 0.903 0.899
0.884 0.883 0.881 0.879 0.873 0.869 0.861 0.859 0.854 0.851 0.849 0.843
0.838 0.834 0.829 0.825 0.821 0.819 0.813 0.809 0.803 0.797 0.788 0.781
0.778 0.773 0.771 0.77 0.768 0.766 0.762 0.755 0.749 0.743 0.741 0.739
0.75 0.753 0.754 0.752 0.744 0.74 0.742 0.737 0.735 0.733 0.73 0.731
0.728 0.724 0.722 0.72 0.719 0.716 0.715 0.714 0.718 0.721 0.717 0.712
0.71 0.709 0.708 0.706 0.707 0.7 0.655 0.654 0.653 0.652 0.651 0.65
0.649 0.646 0.644 0.643 0.639 0.637 0.635 0.636 0.634 0.638 0.64 0.642
0.645 0.659 0.663 0.668 0.672 0.677 0.682 0.683 0.684 0.685 0.688 0.69
0.692 0.695 0.697 0.699 0.701 0.702 0.704 0.711 0.713 0.723 0.727 0.729
0.732 0.748 0.761 0.767 0.782 0.79 0.793 0.802 0.81 0.822 0.827 0.835
0.84 0.846 0.87 0.876 0.885 0.889 0.893 0.896 0.898 0.9 0.904 0.905
0.895 0.894 0.891 0.89 0.888 0.886 0.882 0.88 0.878 0.877 0.942 0.953
0.956 0.959 0.965 0.972 0.977 0.982 0.985 0.987 0.993 1. 1.008 1.016
1.025 1.032 1.037 1.043 1.045 1.047 1.05 1.049 1.046 1.041 1.04 1.039
1.035 1.03 1.031 1.028]


nr.employed
[5191. 5228.1 5195.8 5176.3 5099.1 5076.2 5017.5 5023.5 5008.7 4991.6
4963.6]


y
['no' 'yes']
df.astype('object').describe(include='all').loc['unique', :]age 78
job 12
marital 4
education 8
default 3
housing 3
loan 3
contact 2
month 10
day_of_week 5
duration 1544
campaign 42
pdays 27
previous 8
poutcome 3
emp.var.rate 10.0
cons.price.idx 26.0
cons.conf.idx 26.0
euribor3m 316.0
nr.employed 11.0
y 2
Name: unique, dtype: object

Data Visualisation : Unique Client Data

fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12))

df['age'].plot(kind='hist', ax=axes[0, 0], title='Age of Client', bins=78)
df['job'].value_counts().plot(kind='bar', ax=axes[0, 1], title='Jobs of Client')
df['marital'].value_counts().plot(kind='bar', ax=axes[0, 2], title='Marital Status of Client')
df['education'].value_counts().plot(kind='bar', ax=axes[1, 0], title='Education of Client')
df['default'].value_counts().plot(kind='bar', ax=axes[1, 1], title='Client Credit in Default?')
df['housing'].value_counts().plot(kind='bar', ax=axes[1, 2], title='Client have Housing Loan?')
df['loan'].value_counts().plot(kind='bar', ax=axes[2, 0], title='Client have Personal Loan?')

fig.delaxes(axes[2, 1]) # position [2, 1]
fig.delaxes(axes[2, 2]) # position [2, 2]

plt.tight_layout()
png

Data Visualisation: Unique Last Contact of Current Campaign

day_order = ['mon','tue','wed','thu','fri']
month_order = ['mar','apr','may','jun','jul','aug','sep','oct','nov','dec']


fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(18, 12))

df['contact'].value_counts().plot(kind='bar', ax=axes[0, 0], title='Campaign Communication type with Client')
sb.countplot(x ='month', data=df, order=month_order, ax=axes[0, 1]).set(title='Month Client was Last Contacted')
sb.countplot(x = 'day_of_week', data=df, order=day_order, ax=axes[1, 0]).set(title = 'Day of the Week Client was Last Contacted')
df['duration'].plot(kind='hist', ax=axes[1, 1], title='Duration of Clients last Contact (in seconds)', bins=1544)

plt.tight_layout()
png

Data Visualisation: Other Attributes

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(18, 12))

df['campaign'].plot(kind='hist', ax=axes[0, 0], title='No. of Contacts performed per Client for this Campaign', bins=42)
df['pdays'].plot(kind='hist', ax=axes[0, 1], title='No. of Days since Client was Last Contacted for Previous Campaign', bins=27)
df['previous'].plot(kind='hist', ax=axes[1, 0], title='No. of Contacts performed per Client before this Campaign', bins=8)
df['poutcome'].value_counts().plot(kind='bar', ax=axes[1, 1], title='Outcome of Previous Marketing Campaign')

plt.tight_layout()
png

Data Visualisation: Unique Social and Economic Context Attributes

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))

df['emp.var.rate'].plot(kind='hist', ax=axes[0, 0], title='Unemployment Variation Rate (Quarterly Indicator)', bins=20)
df['cons.price.idx'].plot(kind='hist', ax=axes[0, 1], title='Consumer Price Index (Monthly Indicator)')
df['cons.conf.idx'].plot(kind='hist', ax=axes[0, 2], title='Consumer Confidence Index (Monthly Indicator)')
df['euribor3m'].plot(kind='hist', ax=axes[1, 0], title='Euribor 3 Month Rate (Daily Indicator)')
df['nr.employed'].plot(kind='hist', ax=axes[1, 1], title='Number of Employees (Quarterly Indicator)')

fig.delaxes(axes[1, 2])

plt.tight_layout()
png

Data Visualisation: Unique Target Variable

df['y'].value_counts().plot(kind='bar', title='Has Client Subscribed a Term Deposit?')<Axes: title={'center': 'Has Client Subscribed a Term Deposit?'}>
png

D). Duplicate Entities

df.duplicated().sum()12df[df.duplicated(keep=False)]
png

There are 12 duplicate entries to be removed.

E). Outliers

for col in df:
print(f'*{col}*')
print(df[col].value_counts())
print('\n')
*age*
31 1947
32 1846
33 1833
36 1780
35 1759
...
89 2
91 2
94 1
87 1
95 1
Name: age, Length: 78, dtype: int64


*job*
admin. 10422
blue-collar 9254
technician 6743
services 3969
management 2924
retired 1720
entrepreneur 1456
self-employed 1421
housemaid 1060
unemployed 1014
student 875
unknown 330
Name: job, dtype: int64


*marital*
married 24928
single 11568
divorced 4612
unknown 80
Name: marital, dtype: int64


*education*
university.degree 12168
high.school 9515
basic.9y 6045
professional.course 5243
basic.4y 4176
basic.6y 2292
unknown 1731
illiterate 18
Name: education, dtype: int64


*default*
no 32588
unknown 8597
yes 3
Name: default, dtype: int64


*housing*
yes 21576
no 18622
unknown 990
Name: housing, dtype: int64


*loan*
no 33950
yes 6248
unknown 990
Name: loan, dtype: int64


*contact*
cellular 26144
telephone 15044
Name: contact, dtype: int64


*month*
may 13769
jul 7174
aug 6178
jun 5318
nov 4101
apr 2632
oct 718
sep 570
mar 546
dec 182
Name: month, dtype: int64


*day_of_week*
thu 8623
mon 8514
wed 8134
tue 8090
fri 7827
Name: day_of_week, dtype: int64


*duration*
90 170
85 170
136 168
73 167
124 164
...
1569 1
1053 1
1263 1
1169 1
1868 1
Name: duration, Length: 1544, dtype: int64


*campaign*
1 17642
2 10570
3 5341
4 2651
5 1599
6 979
7 629
8 400
9 283
10 225
11 177
12 125
13 92
14 69
17 58
16 51
15 51
18 33
20 30
19 26
21 24
22 17
23 16
24 15
27 11
29 10
28 8
26 8
25 8
31 7
30 7
35 5
32 4
33 4
34 3
42 2
40 2
43 2
56 1
39 1
41 1
37 1
Name: campaign, dtype: int64


*pdays*
999 39673
3 439
6 412
4 118
9 64
2 61
7 60
12 58
10 52
5 46
13 36
11 28
1 26
15 24
14 20
8 18
0 15
16 11
17 8
18 7
22 3
19 3
21 2
25 1
26 1
27 1
20 1
Name: pdays, dtype: int64


*previous*
0 35563
1 4561
2 754
3 216
4 70
5 18
6 5
7 1
Name: previous, dtype: int64


*poutcome*
nonexistent 35563
failure 4252
success 1373
Name: poutcome, dtype: int64


*emp.var.rate*
1.4 16234
-1.8 9184
1.1 7763
-0.1 3683
-2.9 1663
-3.4 1071
-1.7 773
-1.1 635
-3.0 172
-0.2 10
Name: emp.var.rate, dtype: int64


*cons.price.idx*
93.994 7763
93.918 6685
92.893 5794
93.444 5175
94.465 4374
93.200 3616
93.075 2458
92.201 770
92.963 715
92.431 447
92.649 357
94.215 311
94.199 303
92.843 282
92.379 267
93.369 264
94.027 233
94.055 229
93.876 212
94.601 204
92.469 178
93.749 174
92.713 172
94.767 128
93.798 67
92.756 10
Name: cons.price.idx, dtype: int64


*cons.conf.idx*
-36.4 7763
-42.7 6685
-46.2 5794
-36.1 5175
-41.8 4374
-42.0 3616
-47.1 2458
-31.4 770
-40.8 715
-26.9 447
-30.1 357
-40.3 311
-37.5 303
-50.0 282
-29.8 267
-34.8 264
-38.3 233
-39.8 229
-40.0 212
-49.5 204
-33.6 178
-34.6 174
-33.0 172
-50.8 128
-40.4 67
-45.9 10
Name: cons.conf.idx, dtype: int64


*euribor3m*
4.857 2868
4.962 2613
4.963 2487
4.961 1902
4.856 1210
...
3.853 1
3.901 1
0.969 1
0.956 1
3.669 1
Name: euribor3m, Length: 316, dtype: int64


*nr.employed*
5228.1 16234
5099.1 8534
5191.0 7763
5195.8 3683
5076.2 1663
5017.5 1071
4991.6 773
5008.7 650
4963.6 635
5023.5 172
5176.3 10
Name: nr.employed, dtype: int64


*y*
no 36548
yes 4640
Name: y, dtype: int64

Data Visualisation: Outliers

fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(18, 12))

sb.boxplot(x='age', data=df, ax=axes[0, 0])
sb.boxplot(x='duration', data=df, ax=axes[0,1])
sb.boxplot(x='campaign', data=df, ax=axes[0,2])
sb.boxplot(x='pdays', data=df, ax=axes[1,0])
sb.boxplot(x='previous', data=df, ax=axes[1,1])
sb.boxplot(x='emp.var.rate', data=df, ax=axes[1,2])
sb.boxplot(x='cons.price.idx', data=df, ax=axes[2,0])
sb.boxplot(x='cons.conf.idx', data=df, ax=axes[2,1])
sb.boxplot(x='euribor3m', data=df, ax=axes[2,2])
sb.boxplot(x='nr.employed', data=df, ax=axes[3,0])

fig.delaxes(axes[3, 1])
fig.delaxes(axes[3, 2])

plt.tight_layout()
png

Significant outliers exist in the ‘age’ coulmn which will need to be removed.

Outliers also exist in ‘duration’, however, this column will have to be removed completely for classification.

F). Correlations

corr_matrix_df=df.corr()
corr_matrix_df
png
fig_df = plt.subplots(figsize=(17,14))
sb.heatmap(corr_matrix_df, annot=True)
<Axes: >
png

Here we find a number of strongly correlated variables.

emp.var.rate: employment variation rate is positively correlated to nr.employed: number employed (0.91).

euribor3m: euribor 3 month rate* is positively correlated to nr.employed: number employed (0.95) and emp.var.rate: employment variation rate (0.97).

nr.employed: number employed is positively correlated to emp.var.rate: employment variation rate (0.91) and the euribor3m: euribor 3 month rate (0.95).

*the interest rate is the interest rate at which a selection of European banks lend one another.

Data Insights From EDA:

  • There are three major data types. Two data types are numeric in nature, and consist of interger (discrete) and float types. The thrid type is object which indicates the presence of categorical data.
  • The shape of the dataframe consists of 41188 rows and 21 columns (excluding header).
  • The numeric data are generally not normally distributed, with most variables displaying significant levels of skewedness (the mean being greater or less than the 50 percentile).
  • A summary report of the categorical data depicts a profile of the bank’s clients as some characteristics were more frequetly recorded:
  • The bank’s clients were more frequently recorded as being employed as an administrator (‘admin.’), married and having had a university education (‘university.degree’).
  • It is suggested that the bank’s clients have largely not defaulted on their loans, nor have taken out a personal loan, however, there is high frequency of clients who have taken out a housing mortgage.
  • The summary report of the categorical data also suggests that Clients were most frequently contacted in May, and on a Thursday. With a vast majority of clients failing to Subscribe a Term Deposit.
  • There does not appear to be any null values, however, there are some ‘unknown’ values regsitered among the categotical data.
  • A look at the unique values present illustrates the limitations of the dataset’s collection:
  • Clients were only contacted by either telephone or celluar (mobile phone). There was no option for clients to answer by email or survey in their own time.
  • Data collection also only occured between the months of March and December; data collection was not conducted in January and February.
  • Clients were also only contacted during working days (Monday to Friday), not during the weekend. This might have reduced the number of potential responses as clients may have been too busy to answer the call while working.
  • There are 24 duplicate records in the dataset, half of which can be removed. This might be indicative of double registering of data, or contacting a client twice.
  • Outliers exist in the numeric data, particularly in the variables ‘age’ and ‘previous’. Here outliers are defined as values that exist beyond 3 standard deviations from the mean.
  • There is especially strong correlations in the dataset:
  • emp.var.rate: employment variation rate is positively correlated to nr.employed: number employed (0.91).
  • euribor3m: euribor 3 month rate* is positively correlated to nr.employed: number employed (0.95) and emp.var.rate: employment variation rate (0.97).

*the interest rate is the interest rate at which a selection of European banks lend one another.

  • nr.employed: number employed is positively correlated to emp.var.rate: employment variation rate (0.91) and the euribor3m: euribor 3 month rate (0.95).

Step 2 — Data Transformation

A). Remove Null (NaN), Duplicate and Outlier entities in Dataframe

i) Removal of Null Entities

df.isnull().any()age               False
job False
marital False
education False
default False
housing False
loan False
contact False
month False
day_of_week False
duration False
campaign False
pdays False
previous False
poutcome False
emp.var.rate False
cons.price.idx False
cons.conf.idx False
euribor3m False
nr.employed False
y False
dtype: bool

There are no null (NaN) entities in the dataframe, however, there are ‘unknown’ categorical entities.

ii) Removal of Duplicated Entities

# Removal of Duplicated entities. We keep the first duplicate row.
df=df.drop_duplicates(keep='first')
#The amount of rows has decreased by 12.
df.shape
(41176, 21)

iii) Removal of Outlier Entities

# Remove outliers for 'age' of 3 standard deviations from the mean.
standard_deviations = 3
df=df[((df['age'] - df['age'].mean()) / df['age'].std()).abs() < standard_deviations]

df.shape
(40807, 21)

iv) Removal of Unnecessary Column, ‘duration’.

As noted by the creator of the dataset,’duration’ highly affects the output target (e.g., if duration=0 then y=’no’). As such the column ‘duration’ is dropped from the dataframe as it can have a substantial impact on the classification models.

df = df.drop('duration', axis=1)df.shape(40807, 20)

B). Recode Categorical data to Numeric values.

Categorical data is recode to numerical values to allow the classification algorthms to process the data correctly.

df.dtypesage                 int64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
campaign int64
pdays int64
previous int64
poutcome object
emp.var.rate float64
cons.price.idx float64
cons.conf.idx float64
euribor3m float64
nr.employed float64
y object
dtype: object
cat_subset = df.select_dtypes(include=['object']).copy()
print(cat_subset.shape)
cat_subset.head(10)
(40807, 11)
png
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(18, 12))

sb.countplot(x='job',data=cat_subset, palette='hls', ax=axes[0, 0])
sb.countplot(x='marital',data=cat_subset, palette='hls', ax=axes[0, 1])
sb.countplot(x='education',data=cat_subset, palette='hls', ax=axes[0, 2])
sb.countplot(x='default',data=cat_subset, palette='hls', ax=axes[1, 0])
sb.countplot(x='housing',data=cat_subset, palette='hls', ax=axes[1, 1])
sb.countplot(x='loan',data=cat_subset, palette='hls', ax=axes[1, 2])
sb.countplot(x='contact',data=cat_subset, palette='hls', ax=axes[2, 0])
sb.countplot(x='month',data=cat_subset, palette='hls', ax=axes[2, 1], order=month_order)
sb.countplot(x='day_of_week',data=cat_subset, palette='hls', ax=axes[2, 2])
sb.countplot(x='poutcome',data=cat_subset, palette='hls', ax=axes[3, 0])
sb.countplot(x='y',data=cat_subset, palette='hls', ax=axes[3, 1])

fig.delaxes(axes[3, 2])

plt.tight_layout()
png
# Create 'category' variable in cat_subset 
cat_subset["job_Category"] = cat_subset["job"].astype('category')
cat_subset["marital_Category"] = cat_subset["marital"].astype('category')
cat_subset["education_Category"] = cat_subset["education"].astype('category')
cat_subset["default_Category"] = cat_subset["default"].astype('category')
cat_subset["housing_Category"] = cat_subset["housing"].astype('category')
cat_subset["loan_Category"] = cat_subset["loan"].astype('category')
cat_subset["contact_Category"] = cat_subset["contact"].astype('category')
cat_subset["month_Category"] = cat_subset["month"].astype('category')
cat_subset["day_of_week_Category"] = cat_subset["day_of_week"].astype('category')
cat_subset["poutcome_Category"] = cat_subset["poutcome"].astype('category')
cat_subset["y_Category"] = cat_subset["y"].astype('category')
cat_subset.dtypes
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
poutcome object
y object
job_Category category
marital_Category category
education_Category category
default_Category category
housing_Category category
loan_Category category
contact_Category category
month_Category category
day_of_week_Category category
poutcome_Category category
y_Category category
dtype: object
# Convert category values to numeric values
cat_subset["job_Category"] = cat_subset["job_Category"].cat.codes
cat_subset["marital_Category"] = cat_subset["marital_Category"].cat.codes
cat_subset["education_Category"] = cat_subset["education_Category"].cat.codes
cat_subset["default_Category"] = cat_subset["default_Category"].cat.codes
cat_subset["housing_Category"] = cat_subset["housing_Category"].cat.codes
cat_subset["loan_Category"] = cat_subset["loan_Category"].cat.codes
cat_subset["contact_Category"] = cat_subset["contact_Category"].cat.codes
cat_subset["month_Category"] = cat_subset["month_Category"].cat.codes
cat_subset["day_of_week_Category"] = cat_subset["day_of_week_Category"].cat.codes
cat_subset["poutcome_Category"] = cat_subset["poutcome_Category"].cat.codes
cat_subset["y_Category"] = cat_subset["y_Category"].cat.codes

cat_subset.head(10)
png
# Save cat_subset to csv file
cat_subset.to_csv('cat_subset.csv', sep=',')
# Map of Categorical Recoding
cat_recode_map_file = open("cat_recode_map.txt", "w")
for col in cat_subset:
line = f'*{col}* \n {np.sort(cat_subset[col].unique())} \n\n'
cat_recode_map_file.writelines(line)
cat_recode_map_file.close()

cat_recode_map_file = open("cat_recode_map.txt", "r")
for line in cat_recode_map_file:
print(line)
*job*

['admin.' 'blue-collar' 'entrepreneur' 'housemaid' 'management' 'retired'

'self-employed' 'services' 'student' 'technician' 'unemployed' 'unknown']



*marital*

['divorced' 'married' 'single' 'unknown']



*education*

['basic.4y' 'basic.6y' 'basic.9y' 'high.school' 'illiterate'

'professional.course' 'university.degree' 'unknown']



*default*

['no' 'unknown' 'yes']



*housing*

['no' 'unknown' 'yes']



*loan*

['no' 'unknown' 'yes']



*contact*

['cellular' 'telephone']



*month*

['apr' 'aug' 'dec' 'jul' 'jun' 'mar' 'may' 'nov' 'oct' 'sep']



*day_of_week*

['fri' 'mon' 'thu' 'tue' 'wed']



*poutcome*

['failure' 'nonexistent' 'success']



*y*

['no' 'yes']



*job_Category*

[ 0 1 2 3 4 5 6 7 8 9 10 11]



*marital_Category*

[0 1 2 3]



*education_Category*

[0 1 2 3 4 5 6 7]



*default_Category*

[0 1 2]



*housing_Category*

[0 1 2]



*loan_Category*

[0 1 2]



*contact_Category*

[0 1]



*month_Category*

[0 1 2 3 4 5 6 7 8 9]



*day_of_week_Category*

[0 1 2 3 4]



*poutcome_Category*

[0 1 2]



*y_Category*

[0 1]
# Drop object variables from cat_subset
cat_subset = cat_subset.drop('job', axis=1)
cat_subset = cat_subset.drop('marital', axis=1)
cat_subset = cat_subset.drop('education', axis=1)
cat_subset = cat_subset.drop('default', axis=1)
cat_subset = cat_subset.drop('housing', axis=1)
cat_subset = cat_subset.drop('loan', axis=1)
cat_subset = cat_subset.drop('contact', axis=1)
cat_subset = cat_subset.drop('month', axis=1)
cat_subset = cat_subset.drop('day_of_week', axis=1)
cat_subset = cat_subset.drop('poutcome', axis=1)
cat_subset = cat_subset.drop('y', axis=1)
cat_subset.head(10)
png

C). Scale Numeric Data

Numerical data is scaled to enhance the normal distribution of the data, and to help the classification algorthms process the data correctly.

num_subset = df.select_dtypes(include=['float64','int64']).copy()
print(num_subset.shape)
num_subset.head(10)
(40807, 9)
png
num_subset.dtypesage                 int64
campaign int64
pdays int64
previous int64
emp.var.rate float64
cons.price.idx float64
cons.conf.idx float64
euribor3m float64
nr.employed float64
dtype: object
scaler = MinMaxScaler()

#Apply the scaler to the numerical data
# and save the data back to the dataframe
# overwriting the original values
num_subset[['age','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']] = scaler.fit_transform(num_subset[['age','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']])
num_subset.head(10)
png
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12))

sb.histplot(x= 'age', data=num_subset, palette='hls', ax=axes[0,0])
sb.histplot(x= 'campaign', data=num_subset, palette='hls', ax=axes[0,1])
sb.histplot(x= 'pdays', data=num_subset, palette='hls', ax=axes[0,2])
sb.histplot(x= 'previous', data=num_subset, palette='hls', ax=axes[1,0])
sb.histplot(x= 'emp.var.rate', data=num_subset, palette='hls', ax=axes[1,1])
sb.histplot(x= 'cons.price.idx', data=num_subset, palette='hls', ax=axes[1,2])
sb.histplot(x= 'euribor3m', data=num_subset, palette='hls', ax=axes[2,0])
sb.histplot(x= 'nr.employed', data=num_subset, palette='hls', ax=axes[2,1])
sb.histplot(x= 'cons.conf.idx', data=num_subset, palette='hls', ax=axes[2,2])


plt.tight_layout()
<ipython-input-46-dc432afb2251>:3: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'age', data=num_subset, palette='hls', ax=axes[0,0])
<ipython-input-46-dc432afb2251>:4: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'campaign', data=num_subset, palette='hls', ax=axes[0,1])
<ipython-input-46-dc432afb2251>:5: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'pdays', data=num_subset, palette='hls', ax=axes[0,2])
<ipython-input-46-dc432afb2251>:6: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'previous', data=num_subset, palette='hls', ax=axes[1,0])
<ipython-input-46-dc432afb2251>:7: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'emp.var.rate', data=num_subset, palette='hls', ax=axes[1,1])
<ipython-input-46-dc432afb2251>:8: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'cons.price.idx', data=num_subset, palette='hls', ax=axes[1,2])
<ipython-input-46-dc432afb2251>:9: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'euribor3m', data=num_subset, palette='hls', ax=axes[2,0])
<ipython-input-46-dc432afb2251>:10: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'nr.employed', data=num_subset, palette='hls', ax=axes[2,1])
<ipython-input-46-dc432afb2251>:11: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
sb.histplot(x= 'cons.conf.idx', data=num_subset, palette='hls', ax=axes[2,2])
png
# Save num_subset to csv file
num_subset.to_csv('num_subset.csv', sep=',')
# Concatenate cat_subset with num_subset
df_merged = pd.concat([cat_subset, num_subset], axis=1)
df_merged.head(10)
png
# Write df_merged to csv file
df_merged.to_csv('df_merged.csv', sep=',')

D). Removal of Correlated variables

As there is higher correlation for euribor3m variable (0.95 and 0.97) it is perhaps best to retain this variable and drop emp.var.rate and nr.employed.

# Drop emp.var.rate and nr.employed
df_merged = df_merged.drop('emp.var.rate', axis=1)
df_merged = df_merged.drop('nr.employed', axis=1)
df_merged.head(10)
png

E). Sample Data

# Use SMOTE to over-sample to get rid of imbalanced 'y' variable.
# Split this SMOTE data into 70% practice and 30% test dataframes.
print(df_merged['y_Category'].value_counts())
X = df_merged.drop('y_Category', axis=1)
Y = df_merged['y_Category']

sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_resample(X, Y)

df_smote_over = pd.concat([pd.DataFrame(X_res), pd.DataFrame(Y_res, columns=['y_Category'])], axis=1)

print('SMOTE over-sampling:')
print(df_smote_over['y_Category'].value_counts())

df_smote_over['y_Category'].value_counts().plot(kind='bar', title='Count (target)');
0 36349
1 4458
Name: y_Category, dtype: int64
SMOTE over-sampling:
0 36349
1 36349
Name: y_Category, dtype: int64
png
# Move y-Category column to end of DataFrame
s = df_smote_over.pop('y_Category')
ordered_df_smote_over = pd.concat([df_smote_over, s], 1)
ordered_df_smote_over.head()
<ipython-input-53-453620217384>:3: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only.
ordered_df_smote_over = pd.concat([df_smote_over, s], 1)
png
ordered_df_smote_over.head(10)
ordered_df_smote_over.to_csv('ordered_df_smote_over.csv', sep=',')

The data was over-sampled using SMOTE to remove the imbalance within the target variable (y_Category).

# Here the data is split into training (70%) and testing (30%) data sets.
training_data, testing_data = train_test_split(ordered_df_smote_over, test_size=0.3, random_state=25)
training_data.head()
png
testing_data.head()
png
print(f'Size of ordered_df_smote_over {ordered_df_smote_over.shape}')
print(f'Size of training_data {training_data.shape}')
print(f'Size of testing_date {testing_data.shape}')
Size of ordered_df_smote_over (72698, 18)
Size of training_data (50888, 18)
Size of testing_date (21810, 18)
training_data.to_csv('training_data.csv', sep=',')
testing_data.to_csv('testing_data.csv', sep=',')

The over-sampled dataframe was split between a training dataset (training_data) at 70% of the total data, and a test dataset (testing_data) at 30% of the total data. These are then saved to a csv file.

Step 3 — Modelling the Data

# Separate dependent variable (y_Category) from independent variables in training_data and testing_data.
X_train = training_data.iloc[:, 0:-1]
y_train = training_data.iloc[:, -1]
X_test = testing_data.iloc[:, 0:-1]
y_test = testing_data.iloc[:, -1]
X_train.head()
png
y_train.head()26865    0
54182 1
61479 1
32949 0
64114 1
Name: y_Category, dtype: int8

A). Application of Models

Seven baseline classification models were applied to the training data. The purpose of these models are to predict whether the Bank’s clients are more or less likely to subscribe to a term deposit (no = 0, yes = 1).

These seven classification models include:

  • Naive Bayes.
  • Decision Trees
  • Neural Network
  • K Nearest Neighbour
  • HistGradientBooster
  • Support Vector Machine
  • Random Forests

i). Naive Bayes

Naive Bayes classification calculates the conditional probability of event A given the occurance of event B or events Bn. This is done through simple multiplication and division using the formula below:

P(A,B,Bn)/P(B,Bn)

# Assign values to Bayes variables to allow for reuse.
bayes_X_test = X_test
bayes_y_test = y_test
bayes_X_train = X_train
bayes_y_train = y_train
#Define the model
gnb = GaussianNB() # assigned to variable.
#Train the model
gnb.fit(bayes_X_train, bayes_y_train) # apply the training sets to the model using '.fit(X,y)'
GaussianNB()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GaussianNB

GaussianNB()bayes_y_pred = gnb.predict(bayes_X_test)bayes_accuracy = metrics.accuracy_score(bayes_y_test, bayes_y_pred)*100
bayes_accuracy
72.51719394773039print(f"Gaussian Naive Bayes model accuracy(in %): {bayes_accuracy}")Gaussian Naive Bayes model accuracy(in %): 72.51719394773039# Crosstab columns to display a basic confusion matrix
pd.crosstab(bayes_y_pred, bayes_y_test)
png
bayes_Y_predict_prob = gnb.predict_proba(bayes_X_test)[:,1]
bayes_Y_predict_prob
array([7.15379996e-01, 2.01597225e-07, 5.11831612e-01, ...,
1.22447302e-08, 7.30006819e-01, 2.71216279e-03])
# Print Classifier report
bayes_report = classification_report(bayes_y_test, bayes_y_pred)
print(bayes_report)
precision recall f1-score support

0 0.71 0.78 0.74 11021
1 0.75 0.67 0.71 10789

accuracy 0.73 21810
macro avg 0.73 0.72 0.72 21810
weighted avg 0.73 0.73 0.72 21810
# passing actual and predicted values
bayes_cm = confusion_matrix(bayes_y_test, bayes_y_pred)

# true write data values in each cell of the matrix
sb.heatmap(bayes_cm, annot=True)
plt.title('Confusion Matrix', fontsize = 15)
plt.xlabel('Predicted', fontsize = 13)
plt.ylabel('Actuals', fontsize = 13)

plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
png

ii). Decision Trees

Decision Trees work by splitting data into a series of binary decisions (True or False) from a root node, allowing you to travese through the tree through the decision nodes, until you end at a predicted classification (leaf node).

# Assign values to Decesion Tree variables to allow for reuse.
tree_X_test = X_test
tree_y_test = y_test
tree_X_train = X_train
tree_y_train = y_train
# fit the decision tree model, using default parameters
dt = DecisionTreeClassifier(max_depth = 3, random_state = 45)
dt.fit(tree_X_train,tree_y_train)
DecisionTreeClassifier(max_depth=3, random_state=45)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

DecisionTreeClassifier

DecisionTreeClassifier(max_depth=3, random_state=45)# plotting the decision tree
tree.plot_tree(dt)
[Text(0.5, 0.875, 'x[16] <= 0.768\ngini = 0.5\nsamples = 50888\nvalue = [25328, 25560]'),
Text(0.25, 0.625, 'x[15] <= 0.192\ngini = 0.393\nsamples = 25972\nvalue = [6967, 19005]'),
Text(0.125, 0.375, 'x[15] <= 0.192\ngini = 0.498\nsamples = 11205\nvalue = [5269, 5936]'),
Text(0.0625, 0.125, 'gini = 0.43\nsamples = 5105\nvalue = [1597, 3508]'),
Text(0.1875, 0.125, 'gini = 0.479\nsamples = 6100\nvalue = [3672, 2428]'),
Text(0.375, 0.375, 'x[6] <= 0.5\ngini = 0.204\nsamples = 14767\nvalue = [1698, 13069]'),
Text(0.3125, 0.125, 'gini = 0.178\nsamples = 13986\nvalue = [1380, 12606]'),
Text(0.4375, 0.125, 'gini = 0.483\nsamples = 781\nvalue = [318, 463]'),
Text(0.75, 0.625, 'x[11] <= 0.0\ngini = 0.388\nsamples = 24916\nvalue = [18361, 6555]'),
Text(0.625, 0.375, 'x[7] <= 7.5\ngini = 0.254\nsamples = 8552\nvalue = [7273, 1279]'),
Text(0.5625, 0.125, 'gini = 0.242\nsamples = 8444\nvalue = [7255, 1189]'),
Text(0.6875, 0.125, 'gini = 0.278\nsamples = 108\nvalue = [18, 90]'),
Text(0.875, 0.375, 'x[11] <= 0.018\ngini = 0.437\nsamples = 16364\nvalue = [11088, 5276]'),
Text(0.8125, 0.125, 'gini = 0.0\nsamples = 1586\nvalue = [0, 1586]'),
Text(0.9375, 0.125, 'gini = 0.375\nsamples = 14778\nvalue = [11088, 3690]')]
png
# above is hard to read so using graphviz to improve readability 
target_names=['0','1']
feature_names=['job_category', 'marital_Category', 'education_Category', 'default_Category', 'housing_Category', 'loan_Category', 'contact_Category',
'month_Category', 'day_of_week_Category', 'poutcome_Category', 'age', 'campaign', 'pdays', 'previous', 'cons.price.idx', 'cons.conf.idx',
'euribor3m']
fig = plt.figure(figsize=(20,18))
tree.plot_tree(dt, filled=True, proportion=True, fontsize=10, feature_names=feature_names, class_names=target_names)
[Text(0.5, 0.875, 'euribor3m <= 0.768\ngini = 0.5\nsamples = 100.0%\nvalue = [0.498, 0.502]\nclass = 1'),
Text(0.25, 0.625, 'cons.conf.idx <= 0.192\ngini = 0.393\nsamples = 51.0%\nvalue = [0.268, 0.732]\nclass = 1'),
Text(0.125, 0.375, 'cons.conf.idx <= 0.192\ngini = 0.498\nsamples = 22.0%\nvalue = [0.47, 0.53]\nclass = 1'),
Text(0.0625, 0.125, 'gini = 0.43\nsamples = 10.0%\nvalue = [0.313, 0.687]\nclass = 1'),
Text(0.1875, 0.125, 'gini = 0.479\nsamples = 12.0%\nvalue = [0.602, 0.398]\nclass = 0'),
Text(0.375, 0.375, 'contact_Category <= 0.5\ngini = 0.204\nsamples = 29.0%\nvalue = [0.115, 0.885]\nclass = 1'),
Text(0.3125, 0.125, 'gini = 0.178\nsamples = 27.5%\nvalue = [0.099, 0.901]\nclass = 1'),
Text(0.4375, 0.125, 'gini = 0.483\nsamples = 1.5%\nvalue = [0.407, 0.593]\nclass = 1'),
Text(0.75, 0.625, 'campaign <= 0.0\ngini = 0.388\nsamples = 49.0%\nvalue = [0.737, 0.263]\nclass = 0'),
Text(0.625, 0.375, 'month_Category <= 7.5\ngini = 0.254\nsamples = 16.8%\nvalue = [0.85, 0.15]\nclass = 0'),
Text(0.5625, 0.125, 'gini = 0.242\nsamples = 16.6%\nvalue = [0.859, 0.141]\nclass = 0'),
Text(0.6875, 0.125, 'gini = 0.278\nsamples = 0.2%\nvalue = [0.167, 0.833]\nclass = 1'),
Text(0.875, 0.375, 'campaign <= 0.018\ngini = 0.437\nsamples = 32.2%\nvalue = [0.678, 0.322]\nclass = 0'),
Text(0.8125, 0.125, 'gini = 0.0\nsamples = 3.1%\nvalue = [0.0, 1.0]\nclass = 1'),
Text(0.9375, 0.125, 'gini = 0.375\nsamples = 29.0%\nvalue = [0.75, 0.25]\nclass = 0')]
png
# Check versus test data
tree_y_pred = dt.predict(tree_X_test)

# Print Classifier report
tree_report = classification_report(tree_y_test, tree_y_pred)
print(tree_report)
precision recall f1-score support

0 0.76 0.87 0.81 11021
1 0.84 0.72 0.77 10789

accuracy 0.79 21810
macro avg 0.80 0.79 0.79 21810
weighted avg 0.80 0.79 0.79 21810
tree_accuracy = metrics.accuracy_score(tree_y_test, tree_y_pred)*100
tree_accuracy
79.26639156350298tree_Y_predict_prob = dt.predict_proba(tree_X_test)[:,1]
tree_Y_predict_prob
array([0.9013299 , 0.14081004, 0.39803279, ..., 0.24969549, 0.39803279,
0.14081004])

iii). Neural Networks

Neural Networks consist of a network of neurons which when have inputs passed through produce an output of random weight which is then passed to the next layer of neurons. When every layer of neuron has been considered, the final out put is passed from the terminal neuron.

nn_X_test = X_test
nn_y_test = y_test
nn_X_train = X_train
nn_y_train = y_train
# max_iters=200
nn_model = nn.MLPClassifier()
#fit the model
nn_model.fit(nn_X_train,nn_y_train)
/usr/local/lib/python3.9/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:686: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
warnings.warn(
MLPClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

MLPClassifier

MLPClassifier()# making predictions
nn_y_pred = nn_model.predict(X_test)

# Printing classifier report after prediction
print(classification_report(nn_y_test,nn_y_pred))
precision recall f1-score support

0 0.80 0.82 0.81 11021
1 0.81 0.79 0.80 10789

accuracy 0.81 21810
macro avg 0.81 0.81 0.81 21810
weighted avg 0.81 0.81 0.81 21810
nn_accuracy = metrics.accuracy_score(nn_y_test, nn_y_pred)*100
nn_accuracy
80.70609812012837nn_Y_predict_prob = nn_model.predict_proba(nn_X_test)[:,1]
nn_Y_predict_prob
array([0.88516379, 0.1424067 , 0.30154708, ..., 0.02698015, 0.69696304,
0.37357176])

iv). K Nearest Neighbour

k Nearest Neighbour is a classification model that predicts the desired output of an entity based on the attributes of its nearest neighbours with ‘k’ meaning the number of neighbours to be considered in predicting the output.

knn_X_test = X_test
knn_y_test = y_test
knn_X_train = X_train
knn_y_train = y_train
# set k=3
knn_model = KNeighborsClassifier(n_neighbors=3)
# fit the model
knn_model.fit(knn_X_train,knn_y_train)
KNeighborsClassifier(n_neighbors=3)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

KNeighborsClassifier

KNeighborsClassifier(n_neighbors=3)# Make Predictions
knn_y_pred = knn_model.predict(knn_X_test)
# Printing classifier report
knn_report = classification_report(knn_y_test,knn_y_pred)
print(knn_report)
precision recall f1-score support

0 0.90 0.78 0.84 11021
1 0.80 0.91 0.85 10789

accuracy 0.85 21810
macro avg 0.85 0.85 0.85 21810
weighted avg 0.85 0.85 0.84 21810
knn_accuracy = metrics.accuracy_score(knn_y_test, knn_y_pred)*100
knn_accuracy
84.54837230628152knn_Y_predict_prob = knn_model.predict_proba(knn_X_test)[:,1]
knn_Y_predict_prob
array([1. , 0. , 1. , ..., 0. , 0.66666667,
1. ])

v). Histogram-Gradient Booster

Histogram-Gradient Booster is an Ensemble Learning technique which makes use of many weak predictors and combines them to produce a much stronger predictor. In Gradient Boosting, each weak predictor applied to the data learns from the mistakes and errors of the previous predictor.

# Assign values variables to allow for reuse
HGB_X_test = X_test
HGB_y_test = y_test
HGB_X_train = X_train
HGB_y_train = y_train
# Creating the model
HGB_model = HistGradientBoostingClassifier()
# Fitting the model
HGB_model.fit(HGB_X_train, HGB_y_train)
HistGradientBoostingClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

HistGradientBoostingClassifier

HistGradientBoostingClassifier()# Making predictions on the test data
HGB_y_pred = HGB_model.predict(HGB_X_test)
# Printing classifier report
HGB_report = classification_report(HGB_y_test,HGB_y_pred)
print(HGB_report)
precision recall f1-score support

0 0.90 0.95 0.92 11021
1 0.95 0.89 0.92 10789

accuracy 0.92 21810
macro avg 0.92 0.92 0.92 21810
weighted avg 0.92 0.92 0.92 21810
HGB_accuracy = metrics.accuracy_score(HGB_y_test, HGB_y_pred)*100
HGB_accuracy
92.03576341127922HGB_Y_predict_prob = HGB_model.predict_proba(HGB_X_test)[:,1]
HGB_Y_predict_prob
array([0.95319952, 0.03381954, 0.95841725, ..., 0.03160219, 0.99932284,
0.79383067])

vi). Support Vector Machines (SVM)

Support Vector Machines (SVMs) separate data into different classes by using a hyperplane. Support vectors are then used to ensure that the margin of the hyperplane is as large as possible.

svm_X_test = X_test
svm_y_test = y_test
svm_X_train = X_train
svm_y_train = y_train
svm_model = svm.SVC(probability=True)
#fit the model
svm_model.fit(svm_X_train,svm_y_train)
SVC(probability=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

SVC

SVC(probability=True)# making predictions
svm_y_pred = svm_model.predict(svm_X_test)

# Printing classifier report after prediction
svm_report = classification_report(svm_y_test,svm_y_pred)
svm_report
' precision recall f1-score support\n\n 0 0.73 0.78 0.75 11021\n 1 0.76 0.70 0.73 10789\n\n accuracy 0.74 21810\n macro avg 0.74 0.74 0.74 21810\nweighted avg 0.74 0.74 0.74 21810\n'svm_accuracy = metrics.accuracy_score(svm_y_test, svm_y_pred)*100
svm_accuracy
74.3145346171481# Model Probabilities
svm_Y_predict_prob = svm_model.predict_proba(svm_X_test)[:,1]
svm_Y_predict_prob
array([0.83780026, 0.1420343 , 0.46511029, ..., 0.17896118, 0.715812 ,
0.26952522])

vii). Random Forests

Random Forests are an ensemble classification technique that makes use of multiple decison trees, with the most upvoted decison tree beinging returned.

forest_X_test = X_test
forest_y_test = y_test
forest_X_train = X_train
forest_y_train = y_train
# Applying Random Forest model
forest_model = RandomForestClassifier(max_depth=5)
forest_model.fit(forest_X_train,forest_y_train)
RandomForestClassifier(max_depth=5)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RandomForestClassifier

RandomForestClassifier(max_depth=5)# Making Random Forest prediction
forest_y_pred = forest_model.predict(forest_X_test)
# Printing classifier report
forest_report = classification_report(forest_y_test,forest_y_pred)
print(forest_report)
precision recall f1-score support

0 0.77 0.82 0.80 11021
1 0.80 0.76 0.78 10789

accuracy 0.79 21810
macro avg 0.79 0.79 0.79 21810
weighted avg 0.79 0.79 0.79 21810
# Prinitng the model accuracy
forest_accuracy = metrics.accuracy_score(forest_y_test, forest_y_pred)*100
forest_accuracy
78.72994039431454forest_Y_predict_prob = forest_model.predict_proba(forest_X_test)[:,1]
forest_Y_predict_prob
array([0.77282383, 0.12004477, 0.55445956, ..., 0.14645238, 0.57616158,
0.44912768])

B). Model Evaluation

i). Model Accuracy

A model’s accuracy score is defined as:

the number of correct predictions divided by the total number of predictions.

# Accuracy Data
print(f'Bayes accuracy - {bayes_accuracy}')
print(f'Decision Tree accuracy - {tree_accuracy}')
print(f'Neural Network accuracy - {nn_accuracy}')
print(f'KNN accuracy - {knn_accuracy}')
print(f'HistGradientBoost accuracy - {HGB_accuracy}')
print(f'Support Vector Machine accuracy - {svm_accuracy}')
print(f'Random Forrest - {forest_accuracy}')

accuracy_data = [['Naive Bayes', bayes_accuracy],['Decision Tree', tree_accuracy],
['Neural Network', nn_accuracy], ['KNN',knn_accuracy],
['HGBoost',HGB_accuracy],['SVM',svm_accuracy],['Random Forest', forest_accuracy]]
accuracy_df = pd.DataFrame(accuracy_data, columns=['Model','Accuracy'])
Bayes accuracy - 72.51719394773039
Decision Tree accuracy - 79.26639156350298
Neural Network accuracy - 80.70609812012837
KNN accuracy - 84.54837230628152
HistGradientBoost accuracy - 92.03576341127922
Support Vector Machine accuracy - 74.3145346171481
Random Forrest - 78.72994039431454
fig = plt.subplots(figsize=(10,8))
sb.barplot(data=accuracy_df,x='Model', y='Accuracy')
<Axes: xlabel='Model', ylabel='Accuracy'>
png

The accuracy scores presented above show that:

  • the Hisorgram-Gradient Booster is the most accurate classification technique with an accuracy score of 91.47%. Bearing in mind this is an ensemble technique that uses many weak predictors to produce a much stronger predictor, this is not a surprising result.
  • k Nearest Neighbours is the second most accurate classification technique with an accuracy score of 84.55%
  • Naive Bayes is the least accurate classification technique at 72.52%. This is as expected due to the simple nature of the classification technique.

ii). Model Confusion Matrix

A confusion matrix breaks down exactly how many predictions the model got correct (true positives and true negatives) and how many predictions the model got wrong (false positives and false negatives). This evaluation is quite useful for imbalanced datasets.

# Confusion Matrix
# passing actual and predicted values
bayes_cm = confusion_matrix(bayes_y_test, bayes_y_pred)
tree_cm = confusion_matrix(tree_y_test, tree_y_pred)
nn_cm = confusion_matrix(nn_y_test, nn_y_pred)
knn_cm = confusion_matrix(knn_y_test, knn_y_pred)
HGB_cm = confusion_matrix(HGB_y_test, HGB_y_pred)
svm_cm = confusion_matrix(svm_y_test, svm_y_pred)
forest_cm = confusion_matrix(forest_y_test, forest_y_pred)

#PLot of Confusion Matrix
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12))
(
sb.heatmap(bayes_cm, annot=True, ax=axes[0,0], fmt="d")
.set(title='Naive Bayes Confusion Matrix', xlabel='Predicted', ylabel='Actuals')
)

(
sb.heatmap(tree_cm, annot=True, ax=axes[0,1], fmt="d")
.set(title='Decision Tree Confusion Matrix', xlabel='Predicted', ylabel='Actuals')
)
(
sb.heatmap(nn_cm, annot=True, ax=axes[0,2], fmt="d")
.set(title='Neural Network Confusion Matrix', xlabel='Predicted', ylabel='Actuals')
)
(
sb.heatmap(knn_cm, annot=True, ax=axes[1,0], fmt="d")
.set(title='K Nearest Neighbour Confusion Matrix', xlabel='Predicted', ylabel='Actuals')
)
(
sb.heatmap(HGB_cm, annot=True, ax=axes[1,1], fmt="d")
.set(title='HistGradientBooster Confusion Matrix', xlabel='Predicted', ylabel='Actuals')
)
(
sb.heatmap(svm_cm, annot=True, ax=axes[1,2], fmt="d")
.set(title='Support Vector Machine Confusion Matrix', xlabel='Predicted', ylabel='Actuals')
)
(
sb.heatmap(forest_cm, annot=True, ax=axes[2,0], fmt="d")
.set(title='Random Forest Confusion Matrix', xlabel='Predicted', ylabel='Actuals')
)

fig.delaxes(axes[2, 1])
fig.delaxes(axes[2, 2])

plt.tight_layout()
png

Some information can be gleamed from the confusion matrix above:

  • k Nearest Neighbour was capable of correctly predicting 9,800 true positives, in other words, kNN was able to predict the most amount of term deposit sales for the bank.
  • k Nearest Neighbour was the 4th best performing classification technique in correctly predicting true negatives.
  • Histogram-Gradient Booster was the second best classification technique in predicting 9,501 true positives.
  • Histogram-Gradient Booster was the best classification technique in predicting true negatives with 10,449 correctly predicted.
  • Neural Network was the classification technique that predicted the most amount of false negatives (2,579).
  • This can be compared to Histogram-Gradient Booster 572 fales positives.
  • Naive Bayes predicetd the most amount of false negatives (3,538). This would indicate a substantial loss of sales if Naive Bayes classification was adapted by the bank.

iii). ROC Curve

Receiver Operating Characteristics (ROC) curves plot the accuracy of classification models across all classification thresholds by ploting the true positive rate and the false positive rate. This is done through probability.

The Area Under the Curve (AUC) is the measure of the model’s accuracy.

#Model Probabilities
bayes_Y_predict_prob
bayes_fpr , bayes_tpr, thresholds = roc_curve(bayes_y_test, bayes_Y_predict_prob)
bayes_roc_auc = auc(bayes_fpr, bayes_tpr)
print('AUC = ', bayes_roc_auc)
print('tpr = ', bayes_tpr)
print('fpt = ', bayes_fpr)
AUC = 0.7787347243592939
tpr = [0. 0.15339698 0.15450922 ... 0.99990731 1. 1. ]
fpt = [0. 0.01488068 0.01488068 ... 0.9916523 0.9916523 1. ]
tree_Y_predict_prob
tree_fpr , tree_tpr, thresholds = roc_curve(tree_y_test, tree_Y_predict_prob)
tree_roc_auc = auc(tree_fpr, tree_tpr)
print('AUC = ', tree_roc_auc)
print('tpr = ', tree_tpr)
print('fpt = ', tree_fpr)
AUC = 0.8420156670710688
tpr = [0. 0.06182223 0.55584392 0.55908796 0.6984892 0.71693391
0.81008435 0.95263695 1. ]
fpt = [0. 0. 0.05444152 0.05507667 0.12013429 0.13320025
0.27756102 0.70783051 1. ]
nn_Y_predict_prob
nn_fpr , nn_tpr, thresholds = roc_curve(nn_y_test, nn_Y_predict_prob)
nn_roc_auc = auc(nn_fpr, nn_tpr)
print('AUC = ', nn_roc_auc)
print('tpr = ', nn_tpr)
print('fpt = ', nn_fpr)
AUC = 0.8863422620684822
tpr = [0.00000000e+00 9.26869960e-05 5.98757994e-02 ... 9.99907313e-01
1.00000000e+00 1.00000000e+00]
fpt = [0. 0. 0. ... 0.99274113 0.99274113 1. ]
knn_Y_predict_prob
knn_fpr , knn_tpr, thresholds = roc_curve(knn_y_test, knn_Y_predict_prob)
knn_roc_auc = auc(knn_fpr, knn_tpr)
print('AUC = ', knn_roc_auc)
print('tpr = ', knn_tpr)
print('fpt = ', knn_fpr)
AUC = 0.90331035294066
tpr = [0. 0.74446195 0.90833256 0.97599407 1. ]
fpt = [0. 0.09218764 0.2160421 0.38853099 1. ]
HGB_Y_predict_prob
HGB_fpr , HGB_tpr, thresholds = roc_curve(HGB_y_test, HGB_Y_predict_prob)
HGB_roc_auc = auc(HGB_fpr, HGB_tpr)
print('AUC = ', HGB_roc_auc)
print('tpr = ', HGB_tpr)
print('fpt = ', HGB_fpr)
AUC = 0.967964763702531
tpr = [0.00000000e+00 9.26869960e-05 5.56121976e-04 ... 1.00000000e+00
1.00000000e+00 1.00000000e+00]
fpt = [0. 0. 0. ... 0.99891117 0.99909264 1. ]
svm_Y_predict_prob
svm_fpr , svm_tpr, thresholds = roc_curve(svm_y_test, svm_Y_predict_prob)
svm_roc_auc = auc(svm_fpr, svm_tpr)
print('AUC = ', svm_roc_auc)
print('tpr = ', svm_tpr)
print('fpt = ', svm_fpr)
AUC = 0.8243841968411084
tpr = [0.00000000e+00 9.26869960e-05 4.63434980e-04 ... 9.99907313e-01
1.00000000e+00 1.00000000e+00]
fpt = [0. 0. 0. ... 0.99655204 0.99655204 1. ]
forest_Y_predict_prob
forest_fpr , forest_tpr, thresholds = roc_curve(forest_y_test, forest_Y_predict_prob)
forest_roc_auc = auc(forest_fpr, forest_tpr)
print('AUC = ', forest_roc_auc)
print('tpr = ', forest_tpr)
print('fpt = ', forest_fpr)
AUC = 0.8748642042157
tpr = [0.00000000e+00 9.26869960e-05 7.41495968e-04 ... 1.00000000e+00
1.00000000e+00 1.00000000e+00]
fpt = [0. 0. 0. ... 0.99809455 0.99827602 1. ]
fig = plt.subplots(figsize=(10,8))
plt.plot(bayes_fpr, bayes_tpr, label= '%s ROC (area = %0.2f)' % ('Naive Bayes', bayes_roc_auc))
plt.plot(tree_fpr, tree_tpr, label= '%s ROC (area = %0.2f)' % ('Decision Tree', tree_roc_auc))
plt.plot(nn_fpr, nn_tpr, label= '%s ROC (area = %0.2f)' % ('Neural Network', nn_roc_auc))
plt.plot(knn_fpr, knn_tpr, label= '%s ROC (area = %0.2f)' % ('KNN', knn_roc_auc))
plt.plot(HGB_fpr, HGB_tpr, label= '%s ROC (area = %0.2f)' % ('HistGradientBooster', HGB_roc_auc))
plt.plot(svm_fpr, svm_tpr, label= '%s ROC (area = %0.2f)' % ('Support Vector Machine', svm_roc_auc))
plt.plot(forest_fpr, forest_tpr, label= '%s ROC (area = %0.2f)' % ('Random Forest', forest_roc_auc))
plt.legend()
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('ROC Curve')
plt.plot([0, 1], [0, 1], 'k--')
plt.show()
png

The AUC scores mark Histogram-Gradient Booster as the highest classification model at 0.97, with k Nearest Neighbours second at 0.90.

Unlike the accuracy score mentioned above, the AUC also takes into account the sesitivity and recall (correctly predicting all the positive observations) of a classification model.

iv). K-Fold Cross Validation

log_cols=["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)

classifiers = [
('KNN', knn_model),
('Naive Bayes', gnb),
('Decision Tree', dt),
('Random Forest', forest_model),
('Neural Network', nn_model),
('HGBoost', HGB_model)]
# Applies K Fold Cross Verification to 6 of the 7 classification models used in this report.
# The number of folds (K) increases interatively from 2 to 5 to test the consistentcy of the models' accuracy.
# Support Vector Machine was not included in a bid to reduce the run time of the report.
# The results are then saved to a text file (KFold_verification_report.txt).
KFold_verification_file = open("KFold_verification_report.txt", "w")
for name, model in classifiers:
counter = 2
title = f'#########\n{name}\n#########\n'
KFold_verification_file.writelines(title)
while counter < 6:
scores = cross_val_score(model, X_train, y_train, scoring='accuracy',
cv=KFold(n_splits=counter, shuffle=True, random_state=42),
n_jobs=-1)
line = f'Classification:{model}\n KFold splits {counter}\n Accuracy: {mean(scores)} {std(scores)}\n'
counter = counter +1
KFold_verification_file.writelines(line)
print(f'Completed report for {name}.')


KFold_verification_file.close()
Completed report for KNN.
Completed report for KNN.
Completed report for KNN.
Completed report for KNN.
Completed report for Naive Bayes.
Completed report for Naive Bayes.
Completed report for Naive Bayes.
Completed report for Naive Bayes.
Completed report for Decision Tree.
Completed report for Decision Tree.
Completed report for Decision Tree.
Completed report for Decision Tree.
Completed report for Random Forest.
Completed report for Random Forest.
Completed report for Random Forest.
Completed report for Random Forest.
Completed report for Neural Network.
Completed report for Neural Network.
Completed report for Neural Network.
Completed report for Neural Network.
Completed report for HGBoost.
Completed report for HGBoost.
Completed report for HGBoost.
Completed report for HGBoost.
KFold_verification_file = open("KFold_verification_report.txt", "r")
for line in KFold_verification_file:
print(line)
#########

KNN

#########

Classification:KNeighborsClassifier(n_neighbors=3)

KFold splits 2

Accuracy: 0.8104857726772521 0.0005895299481213923

Classification:KNeighborsClassifier(n_neighbors=3)

KFold splits 3

Accuracy: 0.8258331226412822 0.003009656019646588

Classification:KNeighborsClassifier(n_neighbors=3)

KFold splits 4

Accuracy: 0.8331433736833831 0.004535380583104739

Classification:KNeighborsClassifier(n_neighbors=3)

KFold splits 5

Accuracy: 0.8350690344278252 0.004298405622793742

#########

Naive Bayes

#########

Classification:GaussianNB()

KFold splits 2

Accuracy: 0.7232156893570193 0.00029476497406066837

Classification:GaussianNB()

KFold splits 3

Accuracy: 0.7230977530323256 0.0014240380650607956

Classification:GaussianNB()

KFold splits 4

Accuracy: 0.7232156893570193 0.0018670516768723347

Classification:GaussianNB()

KFold splits 5

Accuracy: 0.723097692748356 0.003540038077199142

#########

Decision Tree

#########

Classification:DecisionTreeClassifier(max_depth=3, random_state=45)

KFold splits 2

Accuracy: 0.7911884923754127 0.0003144159723313944

Classification:DecisionTreeClassifier(max_depth=3, random_state=45)

KFold splits 3

Accuracy: 0.7912670726509544 0.0010726608077781356

Classification:DecisionTreeClassifier(max_depth=3, random_state=45)

KFold splits 4

Accuracy: 0.7912081433736834 0.0012358275179633579

Classification:DecisionTreeClassifier(max_depth=3, random_state=45)

KFold splits 5

Accuracy: 0.791267074259376 0.001276471336819312

#########

Random Forest

#########

Classification:RandomForestClassifier(max_depth=5)

KFold splits 2

Accuracy: 0.781814966200283 0.0034782266939160644

Classification:RandomForestClassifier(max_depth=5)

KFold splits 3

Accuracy: 0.7813629799414379 0.0005158087421734511

Classification:RandomForestClassifier(max_depth=5)

KFold splits 4

Accuracy: 0.7824634491432164 0.001851370179738072

Classification:RandomForestClassifier(max_depth=5)

KFold splits 5

Accuracy: 0.7855485128783511 0.00579122730070771

#########

Neural Network

#########

Classification:MLPClassifier()

KFold splits 2

Accuracy: 0.7845857569564534 0.004519729602263767

Classification:MLPClassifier()

KFold splits 3

Accuracy: 0.7972212672061806 0.002972989640945583

Classification:MLPClassifier()

KFold splits 4

Accuracy: 0.7993043546612167 0.003223422676408872

Classification:MLPClassifier()

KFold splits 5

Accuracy: 0.7992061903405807 0.007026778543617887

#########

HGBoost

#########

Classification:HistGradientBoostingClassifier()

KFold splits 2

Accuracy: 0.9149701304826285 0.003124508725043218

Classification:HistGradientBoostingClassifier()

KFold splits 3

Accuracy: 0.9152256049797529 0.0018754718477014165

Classification:HistGradientBoostingClassifier()

KFold splits 4

Accuracy: 0.9191557931142902 0.002271709796670077

Classification:HistGradientBoostingClassifier()

KFold splits 5

Accuracy: 0.9175443075716625 0.002521078083626973

The kFold Cross Validation was applied to 6 of the 7 classification techniques used in this report as a means to test the consistency of the techniques accuracy.

There was very little variation in the accuracy scores across the the different number of ksplits (2–5). The accuracy scores tended to stay within one percentage point of the baseline accuracy.

v). Model Tuning

# Model Tuning for the top performing model, HGBoost.
param_grid = {
'min_samples_leaf': [5, 15, 20, 25], # minimum no. of samples present in the leaf node after splitting a node
'max_depth': [3,4,5,6,7,8,9,10] # used to control over-fitting
}
grid_search = GridSearchCV(HGB_model, param_grid, cv = 10, scoring='accuracy')
grid_search.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=HistGradientBoostingClassifier(),param_grid={&#x27;max_depth&#x27;: [3, 4, 5, 6, 7, 8, 9, 10],
&#x27;min_samples_leaf&#x27;: [5, 15, 20, 25]},
scoring=&#x27;accuracy&#x27;)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-8" type="checkbox" ><label for="sk-estimator-id-8" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=10, estimator=HistGradientBoostingClassifier(),
param_grid={&#x27;max_depth&#x27;: [3, 4, 5, 6, 7, 8, 9, 10],
&#x27;min_samples_leaf&#x27;: [5, 15, 20, 25]},
scoring=&#x27;accuracy&#x27;)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-9" type="checkbox" ><label for="sk-estimator-id-9" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: HistGradientBoostingClassifier</label><div class="sk-toggleable__content"><pre>HistGradientBoostingClassifier()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-10" type="checkbox" ><label for="sk-estimator-id-10" class="sk-toggleable__label sk-toggleable__label-arrow">HistGradientBoostingClassifier</label><div class="sk-toggleable__content"><pre>HistGradientBoostingClassifier()</pre></div></div></div></div></div></div></div></div></div></div>
print(f'HGBooster model accuracy before Model Tuning: {HGB_accuracy}')
print('Best hyperparameters are: '+str(grid_search.best_params_))
print('Best score is: '+str(grid_search.best_score_*100))
print('Best estimator is: '+str(grid_search.best_estimator_))
HGBooster model accuracy before Model Tuning: 92.03576341127922
Best hyperparameters are: {'max_depth': 10, 'min_samples_leaf': 25}
Best score is: 91.74460136303362
Best estimator is: HistGradientBoostingClassifier(max_depth=10, min_samples_leaf=25)

Findings & Recommendations

We ran 7 different models to explore what might work best for the telemarketing dataset.This report recommends the bank to adopt the Histogram-Gradient Booster model as it might be the most effective in prediciting the presence and absence of future sales among its clients.In terms of Accuracy it achieved a score of 91.8%. It also achieved high scores in Precesion, Recall and F1-Score.We also noted that this accuracy score was higher than that achieved by the researchers on the study "A Data-Driven Approach to Predict the Success of Telemarketing"(Moro et al., 2014). That study found that the random forest model performed best in predicting the success of the telemarketing, however, it must be noted that study used different input variables than was used for this report.

Summary

For this Assignment (Problem Set 1 - Portuguese Bank Telemarketing Campaign), we began by downloading the dataset from the UCI ML Repository. We also looked at the metadata, variables etc they provided. We then read & discussed the study "A Data-Driven Approach to Predict the Success of Telemarketing"    (Moro et al., 2014) to see how they approached their analysis of the dataset.After importing the dataset we began our EDA which included:
  • Getting a basic overview of the dataset (datatypes, size, profile information, Unique Entities etc)
  • Looking at what cleaning would be required (NaNs, Outliers, Duplicates etc)
  • Correlation Analysis
  • Some Visualisation of the Data using Matplotlib
We then noted some of our initial insights into the dataStep 2 involved preprocessing and transforming the data. This included:
  • Removal of NaNs, Duplicates, some Outliers, an unnecessary column & some correlated variables
  • Recoding Categorical Data to Numeric Data using MinMaxScaler
  • Using SMOTE to over-sample to get rid of imbalanced ‘y’ variable
The next step we undertook was modelling the data. We chose seven classification models:
  • Naive Bayes
  • Decision Trees
  • Neural Network
  • K Nearest Neighbour
  • Histogram-Gradient Booster
  • Support Vector Machine
  • Random Forests
We then evaluated the effectiveness of the 7 models using:
  • Accuracy scores
  • Confusion Matrices
  • ROC Curves
  • K-Fold Validation
Following this we recommended some possible model tuning for the top performing model (HGB) and summarised our Findings and Recommendations based on our analysis of the dataset.