Image for post
Image for post

Regression in Machine Learning

Apoorva Dave
Dec 4, 2018 · 6 min read
Image for post
Image for post

Regression models are used to predict a continuous value. Predicting prices of a house given the features of house like size, price etc is one of the common examples of Regression. It is a supervised technique. A detailed explanation on types of Machine Learning and some important concepts is given in my previous article.

Types of Regression

  1. Simple Linear Regression
  2. Polynomial Regression
  3. Support Vector Regression
  4. Decision Tree Regression
  5. Random Forest Regression

Simple Linear Regression

This is one of the most common and interesting type of Regression technique. Here we predict a target variable Y based on the input variable X. A linear relationship should exist between target variable and predictor and so comes the name Linear Regression.

Consider predicting the salary of an employee based on his/her age. We can easily identify that there seems to be a correlation between employee’s age and salary (more the age more is the salary). The hypothesis of linear regression is

Image for post
Image for post

Y represents salary, X is employee’s age and a and b are the coefficients of the equation. So in order to predict Y (salary) given X (age), we need to know the values of a and b (the model’s coefficients).

Image for post
Image for post

While training and building a regression model, it is these coefficients which are learned and fitted to training data. The aim of the training is to find the best fit line such that cost function is minimized. The cost function helps in measuring the error. During the training process, we try to minimize the error between actual and predicted values and thus minimizing the cost function.

In the figure, the red points are the data points and the blue line is the predicted line for the training data. To get the predicted value, these data points are projected on to the line.

To summarize, our aim is to find such values of coefficients which will minimize the cost function. The most common cost function is Mean Squared Error (MSE) which is equal to the average squared difference between an observation’s actual and predicted values. The coefficient values can be calculated using the Gradient Descent approach which will be discussed in detail in later articles. To give a brief understanding, in Gradient descent we start with some random values of coefficients, compute the gradient of cost function on these values, update the coefficients and calculate the cost function again. This process is repeated until we find a minimum value of cost function.

Polynomial Regression

In polynomial regression, we transform the original features into polynomial features of a given degree and then apply Linear Regression on it. Consider the above linear model Y = a+bX is transformed into something like

Image for post
Image for post

It is still a linear model but the curve is now quadratic rather than a line. Scikit-Learn provides PolynomialFeatures class to transform the features.

Image for post
Image for post

If we increase the degree to a very high value, the curve becomes overfitted as it learns the noise in the data as well.

Support Vector Regression

In SVR, we identify a hyperplane with maximum margin such that the maximum number of data points are within that margin. SVRs are almost similar to the SVM classification algorithm. We will discuss the SVM algorithm in detail in my next article.

Instead of minimizing the error rate as in simple linear regression, we try to fit the error within a certain threshold. Our objective in SVR is to basically consider the points that are within the margin. Our best fit line is the hyperplane that has the maximum number of points.

Image for post
Image for post
Data points within the boundary line

Decision Tree Regression

Decision trees can be used for classification as well as regression. In decision trees, at each level, we need to identify the splitting attribute. In the case of regression, the ID3 algorithm can be used to identify the splitting node by reducing the standard deviation (in classification information gain is used).

A decision tree is built by partitioning the data into subsets containing instances with similar values (homogenous). Standard deviation is used to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous, its standard deviation is zero.

The steps for finding the splitting node is briefly described below:

  1. Calculate the standard deviation of the target variable using below formula.
Image for post
Image for post

2. Split the dataset on different attributes and calculate the standard deviation for each branch (standard deviation for target and predictor). This value is subtracted from the standard deviation before the split. The result is the standard deviation reduction.

Image for post
Image for post

3. The attribute with the largest standard deviation reduction is chosen as the splitting node.

4. The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches until all data is processed.

To avoid overfitting, the Coefficient of Deviation (CV) is used which decides when to stop branching. Finally, the average of each branch is assigned to the related leaf node (in regression mean is taken whereas in classification mode of leaf nodes is taken).

Random Forest Regression

Random forest is an ensemble approach where we take into account the predictions of several decision regression trees.

  1. Select K random points
  2. Identify n where n is the number of decision tree regressors to be created. Repeat steps 1 and 2 to create several regression trees.
  3. The average of each branch is assigned to the leaf node in each decision tree.
  4. To predict output for a variable, the average of all the predictions of all decision trees are taken into consideration.

Random Forest prevents overfitting (which is common in decision trees) by creating random subsets of the features and building smaller trees using these subsets.

The above explanation is a brief overview of each regression type. You might have to dig into it to get a clear understanding :)

Following are a few links that can be useful for further reading:

My next article will give an overview of different classification algorithms. Stay tuned :)

Till then happy learning!!

Data Driven Investor

from confusion to clarity not insanity

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Apoorva Dave

Written by

Codes a little 💻, eats a lot 🍕

Data Driven Investor

from confusion to clarity not insanity

Apoorva Dave

Written by

Codes a little 💻, eats a lot 🍕

Data Driven Investor

from confusion to clarity not insanity

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store