Building a Term Deposit Predictive Model

Lawal Adewale Ogunfowora
3 min readAug 29, 2020

--

Well, it’s such an interesting time we are with all the uncertainties the world over. There has been so much languishing given the derailing of plans that accompanied COVID. Amidst all these, I have succeeded, albeit excruciatingly, to cultivate a new passion; playing with data! In this post, I will be walking you through a recent Machine Learning Model I built that predicts whether or not customers will purchase a bank’s Term Deposit based on several factors.

Background of the Project

The Bank of Portugal has a huge amount of data that includes customers profiles of those who have to subscribe to term deposits and the ones who did not subscribe to a term deposit. As their newly employed machine learning researcher, they want me to come up with a robust predictive model that would help them identify customers who would or would not subscribe to their term deposit in the future. The dataset can be accessed here, I worked with the CSV file named bank_additional_full.csv.

Exploratory Data Analysis

To understand the dataset, I used a combination of Tableau, Seaborn, and Matplotlib. A story (comprising of two dashboards) that I produced from the analysis on Tableau can be found here. The modular codes I used for the Seaborn and Matplotlib are also shown below:

def catplot(x,data):
plot= sns.catplot(x, kind="count", data=data, palette="Set1")
plt.xticks(rotation=45, horizontalalignment='right' )
plt.title("counts"+ " "+ "of" + " "+ " " + x )
return
def boxplot(x, y, data=data, hue= "y"):
plot = sns.boxplot(x= x, y=y, hue=hue, data= data)
plt.xticks( rotation=45, horizontalalignment='right' )
plt.title("Boxplot of " + " " + x.upper() + " " + "and "+ " " + y.upper())
return plot

The several plots I obtained can be are available in the project notebook which I will share to link to at the end of this article. However, a noteworthy plot is shown below:

Image by author

The plot above shows the imbalanced nature of the dataset, there are a lot of “no” that would definitely skew the prediction of the model to give more “no” than “yes”. A way to deal with this is to balance the dataset using SWOTE, and that is exactly the approach I took.

Data Pre-Processing

As it is with almost all data-endeavors, so it is with ML; most of the time is spent working on getting the data ready for actual analysis. As this project is a classification problem and the data has to be in a form that the model can understand, I encoded all the categorical columns using OneHotEncoding through the pandas get_dummies method depicted below:

def createDummies(data):
df = pd.get_dummies(data=data, columns=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'day_of_week'])
return df
df = createDummies(data)

After the encoding, I scaled the numerical columns using StandardScaler and consequently transform and fit the parameters;

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

Then I went on to split the dataset into the train and test part after which I did the balancing I mentioned above using SWOTE and plotted a graph to show the balanced dataset;

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 33)
X_train_new, y_train_new = sm.fit_sample(X_train, y_train.ravel())

Now, the data is ready to be loaded into the models to make the necessary predictions.

I used Logistic Regression and Decision Trees for the modeling and the accuracy-test is over 80% for both which is available in the GitHub repo below. Unarguably, there is still a lot of work to be done to improve the quality of the model which I will update as I get better in the field. And I would be remiss if I did not note my excitement about the capabilities machine learning will afford me in proffering value in teams.

Cheers to growth!

Here is the notebook as promised. You can also reach me on ogunfoworalawal@gmail.com

--

--

Lawal Adewale Ogunfowora

A Data Science enthusiast: intent of providing positive changes using data