Understanding Classification and Regression Machine Learning Algorithms with Practical Case Studies (Part 1)

Kelechi
5 min readJul 9, 2018

--

Algorithm (source)

Introduction

There is a lot of buzz around machine learning algorithms these days, and rightly so. Machine learning is arguably the most glorified alongside its fellow new technologies like cloud computing, iOT, AR, VR and Blockchain. In order to harness the full potential of machine learning algorithms, it is important to know the best scenarios and case studies in which they should be applied. I set out to understand these algorithms in detail. I conducted classification and regression tasks, and compared the results obtained by these algorithms. I will be meticulously explaining how I went about this and what I discovered along the way.

Classification Algorithms

Classification problem are those that require the prediction of states. Essentially, classification problems require you to predict what group or class or state an entity falls in. For example, prediction of whether or not a credit card transaction is fraudulent one, predicting whether or not a tweet is hate speech, and so on.

The simplest classification problems have just two classes or are binary classification problems. However, classification problems can have more than two classes and these are often referred to as multi-class classification problems.

There are several algorithms that can be used for classification problems. We will explore them in detail by using a practical case study.

Case Study: The Dream House Finance Loan Problem (Link)

About Company

Dream Housing Finance Company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

Problem

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

Approach

My approach was simple. I sought out to understand the data and put myself in the shoes of the company manager. This way I would be able to think beyond the technical aspects and apply some domain knowledge in the manipulation of my data. I spent quality time exploring my data, cleaning it and creating new features before modelling.

Solution (Github Link)

I first detected extreme outliers using the Tukey method and dropped them (two in total). There after I performed some exploratory data analysis. I began with describing the statistical properties of the numerical variables in the data set. I observed the presence of some outliers based on these descriptions.

I then proceeded to perform visualizations. I started with strip plots of the target variable (loan status) against the numerical variables. I made several observations, particularly with the applicant and co-applicant incomes. I then employed the use of hues to visualize three variables at once — two categorical and one numerical. This revealed the variation of loan eligibility status across all numerical variables by gender, etc.

Next, I used count plots to compare categorical variables. I noticed that applicants with credit histories seemed to have higher chances of obtaining loans when compared to those without. I also noticed observations between other categorical variables, all of which are well explained in my notebook.

I detected and filled in missing values where appropriate. I then converted the categorical variables to numerical variables since some of our classification algorithms need that. I detected some new outliers and log transformed them.

I created some new features based on observations from my exploratory data Analysis. I initially created family income which is a summation of applicant and co-applicant income. Thereafter, I created — loan amount/loan amount term, family income/loan amount, family income/loan amount term, applicant income/loan amount, applicant income/loan amount term.

The first will show how much the person has to release per month to pay up the loan. The second and fourth will allow us know how much of the family’s and person’s monthly income is the loan amount. The third and fifth will let us know how much the family and person have to spend monthly over the course of the loan term. There is no information as regards whether the income is the total income over the loan amount term or yearly, but I assumed the former.

I then proceeded to modelling where I identified the important features according to the gradient boosting classifier. I used the following algorithms for modelling:

· Logistic Regression: Even though it has the word regression, this is a classification algorithm that works by trying to fit a line to your data points. This algorithm is based on a logistic function. This function makes probability-like predictions between zero and one that determine the class of a data point. Logistic regression is a very simple algorithm and works well when there is a strong linear relationship between variables and the outcome variable. However, co-linearity between input variable affects this model. It is also prone to overfitting and regularization is advised.

· Linear Discriminant Analysis: This is an algorithm that is used to classify by making predictions by calculating statistical properties of data points. These statistical properties, usually mean and variance are then used to make predictions. This algorithm assumes a Gaussian distribution for datasets, so it is very sensitive to outliers. It can also be used as a dimension reduction algorithm. LDA is also good with multi-class classification problems. It is advisable to standardize your data and ensure it is normalized.

· Gradient Boosting Classifier: Boosting is an ensemble method that trains several weak learners sequentially. Every successive weak learner focuses on learning from the mistakes of the previous learner. Thereafter, the boosting technique combines all the weak learners into one single strong learner.

· Naive Bayes Classifier: This is a machine learning classification algorithm that assumes independence between input variables. This algorithm, which is based on the Bayesian theorem, assumes that the presence or outcome of a feature is not based on the presence or outcome of another.

· Random Forest Classifier: This algorithm combines several strong decision tree learners by bagging. Bagging is a method where several strong learners are trained in parallel. These learners are then combined and their flaws are smoothened out.

· Support Vector Classifier: This algorithm aims to distinctly separate the input variable space. SVC finds the coefficients that yield the highest separations. The distance between the separation hyperplane and the closest data point is referred to as the margin. The aim is to get the highest margin, which yields the most distinct separation.

Conclusion

Upon submitting to the competition platform, the Linear Discriminant Analysis, Logistic Regression and Gradient Boosting classifiers gave the best results on the test data after being fitted with the normal data with accuracies of over 78%. However, on the pre-processed train data, the Naive Bayes and SVC gave the best accuracies of about 72%. This gives us an idea of what algorithms are affected by outliers the most.

I also ended up with a peak accuracy of 78.472%, placing me amongst the top 4% out of about 27k registered participants.

Here is part 2 explaining the regression bit.

PS: the notebook in my github is very well explained. Check it out.

--

--