Linear Regression : A Detailed View of Machine Learning Algorithm

Neha Kushwaha
Analytics Vidhya
Published in
11 min readAug 7, 2020

If you have started dwelling into the world of Machine learning, Linear Regression is one of the most caught on algorithm and more often used to start solving any regression problem. The bigger picture here is, Are you using it without knowing how it actually works? Then this article is for you. My aim is to cover all aspects of theory in this article. Before we jump into building a Linear regression model on our own, it is important to understand the working principle of it. Very often assumed to be the simplest model of Machine Learning, but let’s see if it really is!!

From Pexels by Darwis

What to seek in this Article:

  • What is Simple Linear regression? And terms you must know to understand Linear regression.
  • The mathematical implementation of algorithm behind LR and what does term cost function and Gradient Descent means.
  • Meanwhile we will see the way to optimize a line using Gradient descent approach?
  • And finally , Method to see when to stop the search for best fit line.

Before proceeding to the theory and some math’s equation, let me give a walk-through what we are trying to solve here and where it can be applied?

Fig 1 : Flow chart of LR model

The idea is here is to find out a relationship between a dependent /target variable(y) for one or more independent/predictor variables(x) on the training data set. By applying Learning Algorithm, which in our case is a Linear regression which makes use of linear function as hypothesis.To solve the parameters of this hypothesis equation, minimizing the cost function (also know as squared error function for regression problems). These minimized parameter’s are find with help of gradient descent approach .The final equation obtained from above step is used for future prediction.

Some of it’s applications can be predicting the sales of certain item based on it’s price, seasons or other parameters. It can also be used in the time series analysis for forecasting the result of the data to predict the future.

Knowing the theory

Some term familiarization for Linear Regression. The Linear Regression word itself have three derived terms Line, Linear and Regression.

Line, is the shortest distance between two points. Where as Linear refers to the set of points lying in a straight line on a plane mapping continuous values. And Regression is the form of a predictive modelling technique which investigates the relationship or correlation between dependent and independent variable.

What is an independent variable?

It’s exactly what it sounds like. It is a variable often referred to as target variable which stands alone and isn’t changed by the other variables which you are trying to measure. For example, someone’s age might be an independent variable. Other factors (such as what they eat, how much they go to school, how much television they watch) aren’t going to change a person’s age. In fact, when you are looking for some kind of relationship between variables you are trying to see if the independent variable causes some kind of change in the other variables, or dependent variables.

What’s a dependent variable?

Just like an independent variable, a dependent variable is exactly what it’s sounds like. It depends on other factors. For example, a test score could be a dependent variable because it could change depending on several factors such as how much you studied, how much sleep you got the night before you took the test, or even how hungry you were when you took it. Usually when you are looking for a relationship between two things you are trying to find out what makes the dependent variable change the way it does.

if you are having trouble remembering which is the independent variable and which is the dependent variable. An easy way to remember is to insert the names of the two variables you are using in this sentence in they way that it makes the most sense. Then you can figure out which is the independent variable and which is the dependent variable:

(Independent variable) causes a change in (Dependent Variable) and it isn’t possible that (Dependent Variable) could cause a change in (Independent Variable).

(Time Spent Studying) causes a change in (Test Score) and it isn’t possible that (Test Score) could cause a change in (Time Spent Studying).We see that “Time Spent Studying” must be the independent variable and “Test Score” must be the dependent variable because the sentence doesn’t make sense the other way around.

Knowing the Equation(s)

Since we can define the target variable and have labels for them, these problem falls in realm of supervised Domain, where data is already labeled.

Fig 2: The Equation of line

So, here the relationship of a linear Regression is best defined by equation of straight line which is also the hypothesis of Linear regression and also know to most of us from high school math, and is given by :

                       y = m*x + cIn machine learning convention you will often see expressions as 
h(x) = w0 + w1*x1 or hθ(x) = θ1*x1 + θ0
Where :
1. y or h(x) = the dependent or the target variable
2. m or w1 or x1= the gradient or slope
3. x or x1= the dependent or predictor variable
4. c or w0 or θ0= the intercept on the y-axis

This equation will help to relate a dependent and a independent variable also know as simple linear regression, what about multiple dependent variables? It can be defined using multiple linear regression, and summarized as :

What is Linear Regression?

We already saw the definition of regression in earlier section, which says :

It attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other independent predictors .

Simple linear regression is a statistical method that allows us to summarize and study relationship between two real variables. In simple words, find a line to describe the linear function that best represents the given data and also some time refereed as “Best fit line”.

Fig 3: An example for the best fit line

Based on the given data points, we try to find a line which describes the data in best way. The red line in the above graph does that job for us and is also referred as the best fit line

Let’s define the above best fit line mathematically, with help of a equation:

                  hθ(x)= θ0 + θ1* x 
(Assuming data set have only has one independent variable)

where θ0 and θ1 are real numbers, that describes the best fit line equation. How do you find these real numbers and when can you say it’s the best one. We start with some assumption of these real numbers and then try to get the difference of error between the average predicted and actual data points .It is done by minimizing the cost function and then the assumed real numbers are updated using Gradient descent technique till we reach optimum solution. To understand what it really means let’s see these two important concept.

Cost function

Here, the basic idea is to reduce the error/loss between the predicted output and actual output. In ML, cost functions are used to estimate how badly models are performing. Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y.

Let’s study this with an example. In LR ,we take a training set for study where based on the size of the area we can predict the cost of the house:

Fig 4: A dummy data to illustrate Cost function

The form of hypothesis which we used to prediction is : hθ(x)= θ0 + θ1* x , here θi’s are called the parameter or weights of the model we are going to see how to choose these parameters. With different choices of θi’s parameter, we get unique hypothesis function, for eg.:

Fig 5:The changes in hypothesis

So , here the idea is to choose the θ0, θ1 so that hθ(x) is close to y for our training data set (x, y) .

Fig 6 : Cost function intuition

We are going to minimize the cost function for θ0, θ1 by minimizing the squared difference between hθ(x) - y over the dateset of size m. 1/2m is here to make calculation simple as we are dealing with mean term here. This cost function is also termed as mean squared error(mse).

Fig 7: hypothesis vs Cost function plot
To simplify the cost function assume a more simpler version of hypothesis, where θ0 is zero so the equation now becomes:              Hypothesis : hθ(x) = θ1*x
cost function : J(θ1) = (1/2m)Σ(hθ(x(i))-y(i))^2 from i = 1 to m
Assuming m = 3, y1 = 1, y2 = 2 and y3 = 3
Scenario 1:
θ1 = 1 (Red line in Fig 7)
Then for θ1 = 1, hθ(x) = x which in this case points lies on y(i)
- For θ1 = 1, J(θ1) gives (1/(2*3))(0^2 + 0^2 + 0^2) = 0
and gives hθ(x(i)) = y(i), the points actual and predicted
points lie on each other
Scenario 2: θ1 = 0.5 (Yellow line in Fig 7)
Then for θ1 = 0.5 then hθ(x) = 0.5x
- For θ1 = 0.5,
J(θ1) = (1/(2*3))((1-0.5)^2 + (1-2)^2 + (1.5-3)^2) = 0.68

Similarly you can plot for other values of θ1 and arrive at a graph as shown above.

The above example clears the intuition behind the cost function. Where we try to determine θ values and in-turn figure out the best hypothesis which will be a line most closest to the actual data .

Fig 8 : An illustration to find best fit line by hackernoon

To determine the best approximation of a line, we now know the model need to minimize the cost function. You many now naturally wonder how the cost function is minimized — enters technique called Gradient descent which we will see in details over next section.

Gradient descent

In theory, Gradient descent is an efficient optimization algorithm that attempts to find a local or global minima of a function. It enables a model to learn the gradient or direction that the model should take in order to reduce errors (differences between actual and predicted target variable).

We now know that for simple linear regression example in above section the model parameter θ1 and θ2 should be tweaked or changed to minimize the cost function. The property of differential of maxima and minima resulting in zero helps us to tune our model parameter(A graph illustration shown below). The moment the derivative of cost function approaches zero signifies we have minimized the model, otherwise we keep updating our model parameter till we reach optimum point.

Fig 9 : Differential equation property illustration

Just to make our life bit simpler, we make use of partial derivative as cost function involves multi parameters. Let’s see how we achieve it mathematically.

Fig 10 : Equation of Gradient

The partial differentiation over θ1 and θ2 leads to the final equation, know as Gradient and with this we arrive at the most important and final equation:

                    θ1 = θ1 - α * Gradient
α --> Learning Rate or dampening factor

I know it’s too much of math pun, but this is last and simpler one added to answer few questions which might arise from above equation.

Why Negative sign? From figure 9 we saw maxima and minima points, but we are interested in minimum point here which can only be achieved if we descend down the slope hence the negative sign.

What is the need of learning rate? Without learning rate the moment of point on slope will be uncertain as shown in figure 11 with red lines jumping randomly and never reaching the convergence. To overcome this situation we introduce learning rate also known as dampening factor which helps us take small steps on curve from starting point leading to convergence slowly.

A tip : Keep learning rate small so you don’t miss your convergence or optimum point of the model. As if the size of the steps taken by each gradient is too big, the model might miss the local minimum of the function. If it too small, the model will take a long time to converge

Fig 11 : The effect of learning rate

Gradient descent, therefore, enables the learning process to make corrective updates to learn the estimates that move the model toward an optimal combination of parameters.

To summarize follow below steps to reach the optimum solution:

  1. Initialize the weights with random real numbers for the parameters of hypothesis. In our case it will be θ0 and θ1.
  2. Initialize learning rate(α) , A small value is preferred like 0.01.
  3. Define n_iterations, it is the number of times you want to run the loop for updating weights.
  4. Choose a stopping point, let’s say 0.00001 if our model reaches stopping condition or end of n_iterations we stop the process of updating θ(i) values.
  5. Calculate the cost and gradients and update all model parameter weights using equation:
                  θ(i) = θ(i) — α * Gradient
'i' will be number of weights of the model

5. Keep updating weights value until you reach the stopping condition(mentioned in step 3) which will give us the optimum solution and thus resulting in reduction of gap between actual and predicted target.

6. Sometime it takes too long time to reach the optimum solution, in such cases reinitialize weights in step 1 and repeat all steps until you reach optimum solution.

Once we reach optimum solution of the hypothesis, hurray! Our machine! it has learned!! We have achieved the best fit line and our model is ready to predict.

Summary

The whole intuition behind Linear Regression algorithm is known to us now. It is an algorithm that every Machine Learning enthusiast must know. With this journey i hope you will be able to implement your first linear regression model knowing how it works which will also be very handy when it comes to hyper-parameter tuning. Let’s meet again with python implementation of Linear regression in upcoming article. Stay tuned!! Keep learning!! and Stay safe!! :)

--

--

Neha Kushwaha
Analytics Vidhya

Software engineer by profession ….Data science learner by passion!!!!