Bank marketing campaign Machine Language model in Scala

Vivek Sasikumar
5 min readAug 29, 2018

--

For the project of running Machine Learning in Scala, I used Databricks Community platform with a marketing campaign dataset from a Portuguese Bank. It is a multivariate dataset with 45,211 instances and 17 attributes.

Data Set Information: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

Input variable explanation are as follows:
1 — age (numeric)
2 — job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
3 — marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed)
4 — education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
5 — default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)
6 — housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)
7 — loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)
# related with the last contact of the current campaign:
8 — contact: contact communication type (categorical: ‘cellular’,’telephone’)
9 — month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10 — day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
11 — duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 — campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 — pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 — previous: number of contacts performed before this campaign and for this client (numeric)
15 — poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)
# social and economic context attributes
16 — emp.var.rate: employment variation rate — quarterly indicator (numeric)
17 — cons.price.idx: consumer price index — monthly indicator (numeric)
18 — cons.conf.idx: consumer confidence index — monthly indicator (numeric)
19 — euribor3m: euribor 3 month rate — daily indicator (numeric)
20 — nr.employed: number of employees — quarterly indicator (numeric)

Used Spark-Scala to wrangle & pre-process data to evaluate and select the best machine learning model to use for future marketing campaigns for similar products.

The first step was to analyze the data and find anomalies and clear out null data and see if there are any errors or missing information in dataset. After verifying the data, no null values were found. Next was to understand the percentage of campaign instances where the customer signed up for the offer. It was found to be 11.69% which indicates a imbalanced data.

For the features — job, marital, education, default, housing, loan, contact, month and poutcome — are string based characteristics. In order for the items to be converted to machine readable format, we use string vectorizer and one hot encoder to convert each of the feature into their on binary (0,1) columns. StringIndexer(), OneHotEncoder() and VectorAssembler() were used to create the new dataframe.

The new dataset was then split into into training and testing data at 80–20 proportion, respectively.

Pipelines were used to set up various machine learning models such as Logistic Regression, Random Forest Classifier, Linear Support Vector Machines and KMeans.

From the above table, we can see that the confusion matrix true positives and negatives are imbalanced. This would affect prediction for future data.

To get a better prediction model, we need to balance the campaign positive and negative outcomes for obtaining the data. Multiple datasets are created with equal number of positive and negative outcomes. Since negative outcomes are 88%, we have 7 times negative outcomes to positive outcomes. So I created 3 different random datasets.

Again, the logistic regression model was run. The results were really crisp and confusion matrix seems much better with higher precision and recall. Find below the confusion matrix:

From this information, the test dataset was really a good fit for the new datasets.

Therefore, we can successfully use this for our model for future campaigns. Deploy this model into production for predicting probability of campaign outcome for new customers.

Once model serialization is done and rigorous model testing is done, it can be deployed/embedded into real-time in our application.

I am a student of data science and new to Scala. I would love to hear your opinions on bettering my code and improving my skills.

--

--