ML 101: Linear Regression Tutorial

Amar Budhiraja
5 min readJul 19, 2017

--

Image Courtsey : Google

Before we dive into the actual technique of Linear Regression, lets look at some intuition of it.

Let’s say, I give you the following puzzle:

Given the following values of X and Y, what is the value of Y when X = 5.(1,1), (2,2), (4,4), (100,100), (20, 20)

The answer is : 5. Not very difficult, right?

Now, let’s take a look at different example. Say you have the following pairs of X and Y. Can you calculate the value of Y, when X = 5?
(1,1), (2,4), (4,16), (100,10000), (20, 400)

The answer is : 25. Was it difficult?

Let’s understand a bit as to what happened in the above examples. When we look at the first example, after look at the given pairs, one can establish that the relationship between X and Y is Y = X. Similarly, in the second example, the relationship is Y = X*X.

In these two examples, we can determine the relationship between two given variables (X and Y) because we could easily identify the relationship between them. Overall, machine learning works in the same way.

Your computer looks at some examples and then tries to identify “the most suitable” relationship between the sets X and Y. Using this identified relationship, it will try to predict (or more) for new examples for which you don’t know Y.

Keeping the above idea in mind, I will try to explain what is linear regression.

Regression is usually termed as determining relationship(s) between two or more variables. For example, in the above two examples, X and Y are the variables. X is termed as the independent variable and Y is termed as the dependent variable and Y has a continous range (unlike classification where Y is discrete).

Now, let’s dig a little deeper into the details of regression. The flow of the rest of the post would be like the following:

  1. Simple linear regression.
  2. Explaining how to define “the most suitable relationship”.
  3. Multiple linear regression.
  4. Basic code example.

Simple Linear Regression

Simple Linear Regression (SLR) is termed as simple because there is only independent variable.

For example, consider you only have date and stock prices of a company, you can fit a regression model to the date (as X) and stock price (as Y).

The model which look something like this:
Price = m*Date + c

The equation resembles that of a line with ‘m’ slope and ‘c’ y-intercept.

This is the essence of SLR. Given, an independent and a dependent variable, we fit equation of a line to perform predictions on unseen data.

Note: Date is considered as an integer. It is considered in the following manner — the date where you started is considered O, the next day is 1 and so on.

Explaining how to define “the most suitable relationship”.

We now know what SLR is but how do we find out what are the values of ‘m’ and ‘c’. There are infinite set of values for choosing ‘m’ and ‘c’ but which are most suitable values?

The answer to which values to choose is very intuitive. Given that we have some X and Y (for example, dates and stock prices), the most suitable values of ‘m’ and ‘c’ are the ones that produce the least error across all given X and Y.

The above terminology is also defined as Error. For example, given a relationship Y = m*X +c, for all X that you have seen, predict the Y values, Y`. Take the sum of absolute difference of Y` and Y, the values which have the least sum are the most suitable values.

But the question still remains, for how many values can you do this? There are infinite values of ‘m’ and infinite values of ‘c’. The answer to this question is Gradient Descent algorithm. Gradient Descent is beyond the scope of this post but I will gently introduce it.

Gradient Descent is an algorithm through which we can get the ‘most suitable values’ of ‘m’ and ‘c’ by considering that best ‘m’ and ‘c’ will produce the least error. The basic idea of gradient descent is that you update ‘m’ and ‘c’ as a function of ‘Error’.

m(t+1) = f(m(t), error(t))
c(t+1) = f(c(t), error(t))

For more details on Gradient Descent please refer to any online tutorial on the same.

Multiple Linear Regression

Multiple Linear Regression (MLR) refers to defining a relationship between independent and dependent variables, when there are more than one independent variables to be considered.

For example, let’s say we have to predict the stock prices again. We saw in the previous section, the we can create a model with date as the independent variable and stock price as the dependent variable.

Now, let’s consider one more aspect on which the stock price will depend : the stock price of the previous day.

So, our regression function will now look like :
Price(t+1) = a1 * Price(t) + a2 * Date(t+1) + c

Makes sense?

In the above equation, we have assumed that Price(t) and Date(t+1) are independent of each other.

Let’s try to understand what this equation is trying to say by taking some examples:

  1. a1 = O, a2 = 1.5 , c = 1: In this case, the equation is saying that Price(t+1) does not depend on the stock price of previous day but depends only on date and that too positively i.e. the price is increasing with increase in date.
  2. a1 = 1.5, a2 = O, c = 1: In this case, the equation is saying that the Price(t+1) depends only on the price of previous day and that too in an increasing fashion.
  3. a1 = -1, a2 = 2, c = 1: In this case, Price(t+1) is dependent on both but it deceases as the Price(t) increase and increases with the Date.
  4. a1 = <some number>, a2 = <some number>, c=0: In this case, we notice that c = O and a1 and a2 could be any numbers. This simply means that when Price(t) = O and Date(t+1) = O, Price(t+1) is also O.

The above model can be generalized to take into consideration ’n’ independent variables. In that case, the equation would look like

Y = a1*X1 + a2*X2 + … + an*Xn

Now, let’s look at some Python code to do Linear Regression. Lets consider the first example: (1,1), (2,2), (3,3), (100, 100), (20, 20)

from sklearn import linear_model
reg = linear_model.LinearRegression()
X = [[1], [2], [3], [100], [20]]
Y = [1, 2,3,100,20]
reg.fit (X, Y)
print reg.coef_

The output : [ 1.]

Note that here we assume that the equation is Y = m* X and m is calculated as ‘1’.

What to do next? If you know Python, pick a datatset and try to do regression, it could be predicting house prices or stock prices or even time spent on reading a medium post ;)

Also, try using regression on this Kaggle competition: I bet you will be astounded on what a basic regression technique can achieve.

--

--