Well, it’s such an interesting time we are with all the uncertainties the world over. There has been so much languishing given the derailing of plans that accompanied COVID. Amidst all these, I have succeeded, albeit excruciatingly, to cultivate a new passion; playing with data! In this post, I will be walking you through a recent Machine Learning Model I built that predicts whether or not customers will purchase a bank’s Term Deposit based on several factors.

Background of the Project

The Bank of Portugal has a huge amount of data that includes customers profiles of those who have to subscribe to term deposits and the ones who did not subscribe to a term deposit. As their newly employed machine learning researcher, they want me to come up with a robust predictive model that would help them identify customers who would or would not subscribe to their term deposit in the future. The dataset can be accessed here, I worked with the CSV file named bank_additional_full.csv.

Exploratory Data Analysis

To understand the dataset, I used a combination of Tableau, Seaborn, and Matplotlib. A story (comprising of two dashboards) that I produced from the analysis on Tableau can be found here. The modular codes I used for the Seaborn and Matplotlib are also shown below:

def catplot(x,data):
plot= sns.catplot(x, kind="count", data=data, palette="Set1")
plt.xticks(rotation=45, horizontalalignment='right' )
plt.title("counts"+ " "+ "of" + " "+ " " + x )
def boxplot(x, y, data=data, hue= "y"):
plot = sns.boxplot(x= x, y=y, hue=hue, data= data)
plt.xticks( rotation=45, horizontalalignment='right' )
plt.title("Boxplot of " + " " + x.upper() + " " + "and "+ " " + y.upper())
return plot

The several plots I obtained can be are available in the project notebook which I will share to link to at the end of this article. However, a noteworthy plot is shown below:

Image by author

The plot above shows the imbalanced nature of the dataset, there are a lot of “no” that would definitely skew the prediction of the model to give more “no” than “yes”. A way to deal with this is to balance the dataset using SWOTE, and that is exactly the approach I took.

Data Pre-Processing

As it is with almost all data-endeavors, so it is with ML; most of the time is spent working on getting the data ready for actual analysis. As this project is a classification problem and the data has to be in a form that the model can understand, I encoded all the categorical columns using OneHotEncoding through the pandas get_dummies method depicted below:

def createDummies(data):
df = pd.get_dummies(data=data, columns=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'day_of_week'])
return df
df = createDummies(data)

After the encoding, I scaled the numerical columns using StandardScaler and consequently transform and fit the parameters;

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

Then I went on to split the dataset into the train and test part after which I did the balancing I mentioned above using SWOTE and plotted a graph to show the balanced dataset;

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 33)
X_train_new, y_train_new = sm.fit_sample(X_train, y_train.ravel())

Now, the data is ready to be loaded into the models to make the necessary predictions.

I used Logistic Regression and Decision Trees for the modeling and the accuracy-test is over 80% for both which is available in the GitHub repo below. Unarguably, there is still a lot of work to be done to improve the quality of the model which I will update as I get better in the field. And I would be remiss if I did not note my excitement about the capabilities machine learning will afford me in proffering value in teams.

Cheers to growth!

Here is the notebook as promised.



