Understanding Classification and Regression Machine Learning Algorithms with Practical Case Studies (Part 2)

Kelechi
5 min readJul 10, 2018

--

We have explored classification algorithms with a practical case study in part 1. This Part 2 will be focused on regression algorithms. So, let us cut right down to the chase.

Regression Algorithms

Regression problems are those that involve the prediction of continuous variables. For example, predicting the weight of students based on given data, predicting the scores of students, predicting house prices, etc. This is unlike classification problems whereby we are bothered with grouping observations.

Again, there are several algorithms that can be used for regression problems. We will explore them in detail by using a practical case study.

Case Study: Big Mart Sales Prediction (Link)

Problem:

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

Approach:

It was basically the same as the classification problem. I made some hypotheses before starting out. I posited some factors that could influence the item outlet sales and decided to validate or ditch them during my EDA. Also, I decided to understand the importance of feature scaling on various algorithms, so I had two datasets — a scaled one and an unscaled one. Thereafter, I compared different regression algorithms as the evaluation was the RMSE.

Solution (Github Link):

I explored the data extensively. This led to the discrediting of some of my hypotheses. For instance, I posited that items with high visibility should have high outlet sales. However, the opposite was the case. I noticed several other attributes in the data, all of which are comprehensively explained in my notebook.

I filled in missing variables based on insight from my EDA. I also handled inconsistent entries in the fat content column. I then created some new features — The first was the age of the outlets. (2018 — Establishment year), the second was whether or not the item type is a consumable and the third was whether the item type is a food, drink or non-consumable.

I then converted the categorical variables to numerical ones. Also, the item visibility column had several zeros, which may have represented missing values. So I replaced with the median of item visibility of similar outlets and item identifier. I then scaled the data for and saved it for later comparison with unscaled one.

I then proceeded to modelling and evaluation. I used the following algorithms for modelling:

· Linear Regression: This is an algorithm that works by trying to fit a line to your data points. This line is usually the line of best fit such that it has the minimum distance between most, if not all, data points. The higher the distance between the line and the data points, the greater the error. This is a very simple algorithm and works well when variables are linearly related. However, it punishes co-linearity between variables. Linear regression is prone to overfitting and this is why there exists regularization. Linear regression cannot also handle several input features and cannot handle non-linear relationships.

· Lasso Regression: This is an algorithm that is just like the linear regression, but comes regularized. LASSO stands for Least Absolute Shrinkage and Selection Operator. This algorithm punishes the absolute size of coefficients. This means it can offer an automatic selection of feature as some coefficients can be shrunk to zero, meaning it drops such features.

· Ridge Regression: Ridge regression also penalizes coefficients, but it does so by penalizing the squared size of these coefficients. Unlike Lasso, ridge actually does not force features to zero.

· Gradient Boosting Regressor: Boosting is an ensemble method that trains several weak learners sequentially. Every successive weak learner focuses on learning from the mistakes of the previous learner. Thereafter, the boosting technique combines all the weak learners into one single strong learner.

· Random Forest Regressor: As described earlier, this algorithm combines several strong decision tree learners by bagging. Bagging is a method where several strong learners are trained in parallel. These learners are then combined and their flaws are smoothened out.

· Support Vector Regressor: As described earlier, this algorithm aims to distinctly separate the input variable space. SVR finds the coefficients that yield the highest separations. The distance between the separation hyperplane and the closest data point is referred to as the margin. The aim is to get the highest margin, which yields the most distinct separation.

· K nearest neighbour Regressor: For a data point to be predicted, this algorithm looks to find the most similar data points or neighbours to that data point and uses to it to form the outcome variable. In this case, it may be the mean output variable. The similarity between data points is usually determined by Euclidian distances or Hamming distances. As a result of this, the scales of features will be strong determinants of these distances.

Conclusion

After submitting the solutions to the Analytics Vidhya platform, the following were noticed:

The worst performing algorithm was the KNN on the unscaled data (remember KNN works well with scaled data cos it is distance dependence). However when used on the scaled data, the KNN algorithm performed well. Reducing the RMSE from 2482 to 1246. The Lasso algorithm also performed better on the scaled data than on the unscaled one. RMSE reduced from 2401 to 1202. Same with Ridge regression, Random Forest, Linear regression and SVR. The decrease in SVR was not as significant as the rest though. RMSE reduced from 1814 to 1757. The best performing algorithm was the GradientBoosting Regressor. On the normal data, it had a score of 1155. On the scaled data, however, it gave an RMSE score of 1153. On the competition leaderboard, I am in position number 540 out of over 16 000 registered competitors (top 3.2%).

PS: my notebook is very explanatory, check it out.

--

--