First Solo Machine Learning Model: Group B

Week 12: AI6 Ilorin

The Scenic Route
ai6-ilorin
7 min readMar 14, 2020

--

Several weeks ago, the class was randomly distributed into groups of threes for group projects.

By class, I certainly do not mean “output categories” or “labeled output”, I’m referring to the entire AI Saturdays class that have converged for the past 12 weeks at the Mustapha Akanbi Library every Saturday, to study AI.

We were permitted to collect our very own datasets to work on, whether classification or regression, each group was free to decide.

My team lead, in no time, laid hands on a rather large dataset for us to “wrangle”. The initial plan was to build or train a model that could predict if a customer or client would default on a loan, depending on a number of factors (features). It was quite difficult getting any dataset that remotely correlated with the problem description.

Eventually, we went for another problem statement: a model that predicts the outcome of a marketing campaign to customers.

A Portuguese banking institution had a marketing campaign to 41,188 customers, advertising a subscription offer to each of them. Clearly, this is a binary classification, as each customer’s decision would be to accept the subscription offer (True), or to decline the offer (False). A total of 21 independent variables is attached to the dataset, among which are some really great predictors, and also some really discouraging inputs that shouldn’t have appeared in the first place.

Head of DataFrame
Tail of DataFrame

Because it’s a classification and not a regression, our minds were pretty much set on using Logistic Regression as the ultimate model, after effective Feature Engineering of course. Later on, I had implemented other classification algorithms such as the Naive Bayes and the Decision Trees algorithm to see if they could be better than the traditional Sigmoid model we were used to. It ended in tears.

It was such a huge relief going through the team lead’s notebook, seeing code on other algorithms from SVM to Random Forests. He had comprehensively solved for each, comparing their efficacies, not neglecting the metrics peculiar to each (e.g accuracy_score, f1_score)

I thought of those other algorithms because I had heard that real-life problems with their inherently dirty datasets require thinking outside of the box, and implementing more than one solution to see which works best. Unfortunately, the ego-dampening Type-Errors forced me back to the status quo, Logistic Regression.

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is usually a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X. Logistic Regressions can also also have multi class variables, as in the case of cancers in patients ( example: Type 1, Type 2, Type 3).

Our goal was to train the model to optimally predict outcomes for our test data. Optimally, because the fate of the model rests on our shoulders. Everything we did from (df. head) to Visualization contributes vehemently to the predicting ability of our model.

The Process

  1. Importing Dependencies

Without dependencies, nothing would take shape. This involves invoking the spirits of Python’s powerful libraries like numpy, Scikit Learn, pandas and all.

2. Description

Before our teamlead sent the description of each feature, I was 50% convinced the whole dataset had been written in Aramaic. Most of the input variables were abbreviated, and consequently most values didn’t mean anything. Personally, I was frustrated all the time trying to work it out on my own, until we all got the link and read up thoroughly. I felt the need to integrate the same description into the notebook for clarity sake.

3. Data Display

The data was fed into the notebook through pandas, and we were able to explore many parts of it. We got a good look at the head, the tail, and the info on columns. Using the describe(), we were able to study the properties of each numerical variable, such as age or duration of campaign in seconds. From there, we could ascertain the mean, the standard variation etc.

4. Categorical Features Sorted

In this dataset, we encountered quite a lot of categorical values (non-numerical). Say, the marital feature: it contains just divorced, married, single and unknown. To a model, little sense would come out of this , so we made the decision to convert the categorical variables into numerical variables. Thanks to the magic of Python.

Spot the difference,

Categorical Variables with their non-numerical values
All, except for Output Variable are now numerical features

Some of these features were dropped from the dataset, and some were “ordinalized”.

5. Visualization

Before Visualization, we needed to choose what inputs had the most correlation with the output. For instance, the day of the week when the campaign was made isn’t really as relevant as the duration of the campaign in seconds. The input variables with the obvious correlations were picked and set against the dependent variable, y with contingency tables and then actual graphs. Visualization helps the data really come alive and speak volumes to whoever wasn’t listening before.

Education Visualized
Education plotted against ‘y’
Marital Life Visualized
Marital Life plotted against ‘y’

6. The Outcome Variable

Before the other visualizations appeared, we started by visualizing the output variable. A scary truth jumped out. It appeared as though our classes were imbalanced. About 88% of the 41,188 declined the offer, and the remaining 12(~) agreed to subscribe, thus creating the great divide between an over-represented class (the stingy folk that didn’t subscribe) and the underrepresented class (the spendthrifts who did).

An Imbalance? Or good luck

In my opinion, it’s not a colossal tragedy as most of the datasets that can be accessed by us are engineered or simulated as our instructor would put it. They are not realistic, and it is a little difficult to get hands on real-life, properly filthy data, especially in a place like Nigeria, where everyone actually has PPD (paranoid personality disorder).

I was at an event last year, solely to collect the attendees’ data and enter into Excel, more than one person threatened to go back home if I didn’t stop hassling them for “personal information”.

7. Training the model

More than one algorithm was considered, but that alone wasn’t the only quick fix to the problem of imbalanced classes that had earlier deceived us into thinking we were doing good.

Imbalanced classes are biased or skewed, because the classes are not represented equally.

However, Over sampling and Under-sampling in data analysis are techniques used to adjust the class distribution of a data set. There are many methods used in resampling the data.

We employed the SMOTE method for oversampling the data. Too many people are represented in the “False” class, too much more than the “True” class. We can either overpopulate (oversampling) the underrepresented class (True) with more instances, or take away (under-sample) from the overrepresented class (False).

88:12

The algorithm we used is the SMOTE algorithm which stands for Synthetic Minority Oversampling Technique. What it does is to create synthetic samples from a class, instead of creating copies.

SMOTE Technique

After resampling the data, the Logistic Regression was trained and evaluated.

Other algorithms were used;

Gaussian Naive Bayes
Support Vector Machines
k-nearest neighbours
Decision Trees

The above and several more algorithms were tested. An evaluation was also conducted to ascertain how well each model was doing. Random Forest ranking the highest, while the Perceptron ranked lowest.

Simple and Sweet

Parameters were also optimized using GridSearchCV method, all for the purpose of getting the best possible outcomes.

--

--