Gradient Descent

Nelson Punch
Software-Dev-Explore
8 min readMar 22, 2024
Photo by Kartikeya Srivastava on Unsplash

What is it ?

Gradient Descent is an iterative technique that is used in machine learning and deep learning to find the best possible set of parameters or coefficients for a model, data points and loss function. In other words, it is an iterative method to optimize a model’s parameters.

Gradient Descent technique is employed in backpropagation stage during model training. And model learn from data and getting better and better is because of it.

Parameters ?

One might ask what parameters in a model? To illustrate parameters in a model, I will use a simple model namely Linear Regression. A Linear Regression model would look like below.

formula for a straight line
  • w: This is a parameter. A weight or slop. Tell us how much y increase on average when x increase by one unit
  • b: This is a parameter. A bias or intercept. Tell us expected average for y when x is zero
  • x: This is a feature or data point that used to predict y
  • y: This is label or prediction

An example of salary prediction with this Linear Regression model. We define.

  • w: Wage increase per years of experience. Says 500
  • b: Minimum wage. Says 1000
  • x: Years of experience
  • y: Predicted salary

Then we can see.

Salary = wage increase per years of experience * years of experience + minimum wage

If we have no experience then our salary would be y = 500 * 0 + 1000 where y = 1000 as result. We can say when you have no experience(x) then your salary(y) is 1000 hence y = b.

  • For 1 year experience: y = 500 * 1 + 1000 where y = 1500
  • For 2 year experience: y = 500 * 2 + 1000 where y = 2000
  • For 3 year experience: y = 500 * 3 + 1000 where y = 2500

The pattern here is y increase 500 when x increase by 1 unit. We can say when you have experience(x) increase by 1 year then your salary(y) increase by 500.

Remember there are 2 parameters both w and b. Now you can see Gradient Descent is an iterative method used to find out what is optimized values for w and b so prediction can be as accurate as possible.

Error and Loss

In order to optimize model’s parameters, we need to know how much difference(error) between predicted value and actual value. The loss would be calculated for a set of data.

An error would be.

  • y(with angle bracket): Prediction value
  • y: Actual value(label or answer)

i here is a particular data point from 0 ~ N

Error and loss is different.

  • Error: The difference between actual value and predicted value.
  • Loss: Aggregation of errors on a set of data.

For a Linear Regression model, it’s loss can be calculated with MES(Mean Square Error) Average of all square errors.

  • wx+b: Prediction value
  • y: Actual value

Model will have accurate prediction when loss is low. In other words the lower the loss the better the prediction. When loss is at it minimum, w and b parameters is at its best.

Gradient

How to know the change or loss when a or b parameters changed? Answer is gradient.

Gradient tell us how much a value changed when we slightly vary other value. In our case how much loss will change when we vary wand b separately.

Gradient is also known partial derivative. Our Linear Regression model has 2 parameters a and b and partial derivate can help us to find out the gradient of wand gradient of b.

A gradient of a straight line.

https://www.onlinemathlearning.com/gradient-graphs.html

Imaging X axis is w or b parameters and Y axis is loss then we would have much clear picture of what is gradient.

Partial derivate of loss respect to w parameter.

Partial derivate of loss respect to b parameter.

Partial Derivative

We can use chain rule to calculate partial derivate.

Calculate from outer function to inner function and multiply them together.

Some knowledges about derivative.

The derivative with respect to x(variable).

so we can see an example below.

The derivative with respect to c(constant) is always 0.

With this.

Now we can do partial derivate with respect to w and b.

Partial derivate with respect to w

Keep expression inside not change and derivative of ()² is 2() .

Next for expression in side. First ignore 2()treat b and y as constant. Because derivative of constant is 0 hence wx + 0-0 is wx. x stay in place since it is with w. Derivative of w is 1.

Multiply them together.

wx+b is equal to y(hat)

Finally we got.

Partial derivate with respect to b

Keep expression inside not change and derivative of ()² is 2() .

Next for expression inside. First ignore 2(). Derivative of b is 1 and wx and y are 0.

w here become constant value

Multiply them together.

wx+b is equal to y(hat)

Finally we got.

Partial derivative(Gradient) with respect to each w and b in graph.

Update parameters

Now we know loss and know how to find gradient for w and b. In each iteration of Gradient Descent we need to update both parameters w and b. However there is another parameter namely Learning Rate involved.

Learning Rate is a parameter that can be adjusted manually therefore it is also known as hyper-parameter.

Gradient Descent in graph look like this.

https://blog.gopenai.com/understanding-of-gradient-descent-intuition-and-implementation-b1f98b3645ea

Cost here is another term for loss

In the beginning parameters are initialized randomly and then it look like ball start rolling down hill step by step until it reach bottom of hill where the loss is at minimum.

Steps or Learning step in graph is our Learning Rate. High learning rate would overshoot and end up in another side of hill and too small would take forever to reach bottom of hill.

In Gradient Descent, there is only one learning rate but a model can have many parameters. Thus choosing proper Learning Rate is another important topic.

To update parameter w and b.

New parameter value = previous parameter value — learning rate * gradient

Eta(learning rate).

Code

The complete code is here.

To code Gradient Descent is pretty straightforward.

First we need to generate data points then do Gradient Descent.

We can define number of iterations for Gradient Descent. And each iterations we will do the following.

  • Forward pass(Make prediction)
  • Compute loss
  • Compute gradient
  • Update parameters

Generate data

The purpose of noise in the code is to spread data point a bit otherwise data points will become a straight diagonal line.

Let’s see what data look like and here we also plot a regression line to represent w and b parameters.

The red line is what we expected to be after Gradient Descent.

Gradient Descent

p_w and p_b are initialized randomly and we expect them to be updated in each Gradient Descent iterations and move toward w and b parameters.

Print the result to see p_w and p_b parameters.

We see p_w and p_b are very close to w and b .

We can visualize the regression line change in motion

Green dash line moving closer and closer to red solid line each epochs(Iteration).

Here we also can visualize w and b parameters approaching to their optimal value.

The color represent loss value and the darker the color the higher the loss. The purple line represent w and b parameters.

Conclusion

Gradient Descent is a technique to find best possible set of parameters for a model. In our case Linear Regression model, we have only 2 parameters w and b then we use Gradient Descent technique to optimize w and b.

For a more complex model, it literally doing the same thing except for more than 2 parameters. A complex model may have thousand or million parameters.

Finally, we have a clear picture of what Gradient Descent is.

--

--