Linear Regression With Gradient Descent Derivation

Ashwin Prasad

Published in

Analytics Vidhya

6 min readSep 16, 2020

Pre-Requisites

The only pre-requisites are differentiation and matrix multiplication.

What is Linear Regression ?

Simple Linear Regression is basically a modelling of linear relationship between linearly dependent variables that can be later used to predict dependent variable values for new independent variables
For this, we use the equation of the line : y = m * x + c.

where y is the dependent variable and x is the independent variable

For example: We could predict the salary of a person with the years of experience of the person.
Here, Salary is the dependent variable and experience is the independent variable since, we are predicting salary with the help of experience.

When to use Linear Regression: Linear regression can be performed on data where there is a good linear relationship between dependent and independent variables.
The degree of linear relationship can be found with help of correlation.

Gradient Descent

Gradient Descent is a an optimization algorithm that can be used to find the global or local minima of a differentiable function.

Various Assumptions and definitions before we begin.

Let’s say we have an independent variable x and a dependent variable y.
in order to form the relationship between these 2 variables, we have the equation:
y = x * w + b
where w is weight ( or slope ) ,
b is the bias (or intercept ),
x is the independent variable column vector(examples),
y is the dependent variable column vector(examples)
Our main goal is to find the w and b that defines the relationship between variable x and y correctly. We accomplish this with the help of something called as the Loss Function.

Loss Function: A loss function is a function that signifies how much our predicted values is deviated from the actual values of the dependent variable.

Important Note: we are trying to find the values for w and b such that it minimizes our loss function.

Steps Involved in Linear Regression with Gradient Descent Implementation

Initialize the weight and bias randomly or with 0(both will work).
Make predictions with this initial weight and bias.
Compare these predicted values with the actual values and define the loss function using both these predicted and actual values.
With the help of differentiation, calculate how loss function changes with respect to weight and bias term.
Update the weight and bias term so as to minimize the loss function.

Implementation with Math

Example of Independent and dependent variables respectively

1 . Assumption

Let us say we have an x and y vectors like shown in the above pic (The above one is only for an example).

2. Initialize w and b to 0

w = 0, b = 0

3. Make some predictions with the current w and b. Of course, it’s going to be wrong.

y_pred = x*w + b, where y_pred stands for predicted y values.
This y_pred will also be a vector like y.

4 . Define a loss function

loss = (y_pred — y)²/n
where n is the number of examples in the dataset.
It is obvious that this loss function represents the deviation of the predicted values from the actual.
This loss function will also be a vector. But, we will be summing all the elements in that vector to convert it into scalar.

5. Calculate (∂(loss)/ ∂w)

The derivative of a function of a real variable measures the sensitivity to change of the function value with respect to a change in its argument.

We can use calculus to find how loss changes with respect to w.

loss = (y_pred — y)²/nloss = (y_pred² + y² — 2y*y_pred)/n (expanding the whole square)=>( (x*w+b)² + y² — 2y*(x*w+b))/n (substitute y_pred)=> ((x*w+b)²/n ) + (y²/n) + ((-2y(x*w+b))/n) (splitting the terms)Let A = ((x*w+b)²/n )
Let B = (y²/n), 
Let C = ((-2y(x*w+b))/n)A = ( x²w² + b² + 2xwb )/n (expanding)
∂A/∂w = ( 2x²w + 2xb )/n (differentiating)∂B/∂w = 0 (differentiating)C = (-2yxw — 2yb)/n
∂C/∂w = (-2yx)/n (differentiating)

So, ∂loss/∂w will be the addition of all these terms:

∂loss/∂w = (2x²w + wxb — 2yx)/n
=> (2x(x*w + b — y))/n

So, The derivative of loss with respect to w was found to be:
(2/n)*(y_pred — y)*x. Let us call this dw.

If we perform the same differentiation for loss with respect to b, we’ll get:
(2/n)*(y_pred — y). Let us call this db.

This dw and db are what we call “gradients”

6. Update w and b

As we can see in the figure 6.1, if we initialize the weight randomly, it might not result in a global minimum of the loss function.
It is our duty to update the weights to the point where the loss is minimum. We have calculated dw above.

dw is nothing but the slope of the tangent of the loss function at point w.
Considering the initial position of w.

Important point to understand :

In the above diagram, The slope of the tangent of the loss will be positive as initial value of w is greater and it needs to be reduced so as to attain global minimum.
If the value of w is low and we want to increase it to attain global minimum, the slope of the tangent of loss at point w will be negative

We want the value of w to be a little lower so as to attain global minima of the loss function as shown in the figure 6.1.

We know that dw is positive in the above graph and we need to reduce w.
This can be done by:

w = w — alpha*dw
b = b — alpha*bw

where alpha is a small number ranging between 0.1 to 0.0000001 (approx) and this alpha is knows as the learning rate.
Doing this, we can reduce w if slope of tangent of loss at w is positive and increase w if slope is negative

7. Learning Rate

Learning rate alpha is something that we have to manually choose and it is something which we don’t know beforehand. Choosing it is a matter of trial and error.
The reason we do not directly subtract dw from w is because, it might result in too much change in the value of w and might not end up in global minimum but, even further away from it.

8. Training Loop

The process of calculating the gradients and updating the weight and bias is repeated for several times, which results in the optimized value for weight and bias.

9. Prediction

After the training loop, the value of weight and bias is now optimized, which can be used to predict new values given new x values.

y = x*w + b

Conclusion

That’s it for Linear Regression with gradient descent.
The learning from “Machine Learning” signifies the part where the gradients of w and b are learnt and then w and b are updated.