Understanding Gradient Descent

Implementation with Numpy

Nilesh Barla

Published in

Analytics Vidhya

7 min readOct 24, 2019

Introduction

Gradient descent is one of the most important optimization techniques used in many machine learning algorithms. And most deep learning algorithms has gradient descent in them.

Two of the keys words to be seen from the paragraph above is :

Gradient Descent
Optimisation

Let’s start with optimisation first.

Optimisation by definition is “the action of making the best or most effective use of a situation or resource” — according to Google.

Optimisation according to mathematics is:

A task of either minimising or maximising some function f(x).

If we dare to combine the both of the above definition we get:

The action of minimising or maximising some function f(x) by making use of the most effective resource[s].

The function we want to optimise is often called the objective-function — in our case f(x), and while minimising the objective-function we may refer it to as cost-function or loss-function or maybe error-function.

We want to find argmin f(x) — the minimum value of f(x).

Calculus

Now, lets see the calculus part of reducing the objective-function.

Let’s assume we have a function y=f(x).

Where, x and y both belongs to real number.

Now, the derivative of this function is denoted as f`(x).

What is a derivative?

The derivative of a function of a real number measures the change of output value w.r.t input value. It also suggests that the derivative of the function can be also written as:

The derivative of a function gives us the slope at the point x. It also means that it returns to us the output which scaled by the small or elemental change in the input.

Derivative add more meaning for optimising a function because now we apply small changes in the input x to alter the value for y.

You can watch these sets of videos in YouTube by patrickJMT. Theses will give you the fundamental knowledge of Derivatives and how it works.

Why do we that?

Because, we need to move in the opposite direction by reducing f(x) in small steps in order to trace back to the original parameters that we started with. This technique is called Gradient Descent.

Parameters!!!

Yes, Parameters.

Remember our definition for optimisation?

We need an effective resource in order to minimise or maximise our function f(x). That resource is our parameters.

Our parameter changes the value of x such that we get y.

To recall a familiar high school formula:

f(x) = y

Or, w.x+b = y — formula of the straight line.

Where, w and b is the parameter.

Think of a function as the journey to a distant land. You going into the journey is the input while coming back from the journey along with souvenirs and memories is the output. Of course when you go into the journey you will take money and some expectations as well. Think these as the parameter that will affect output of your journey.
Now, imagine how these parameters will effect the output of your journey.

Gradient Descent

Like mentioned earlier, gradient descent is a technique which helps us to trace back to the original parameter. Also it a fancy name for Partial Derivatives.

But why do we need to do that?

A simple answer is, so that we know that our algorithm is working properly and not generating random variables.

Think of gradient descent as cross-checking your answers.

Example: 1 + x = 5. We know that x will be four. How? Because, x = 5–1. A simple trick that we learnt in mid-school. The same idea goes with gradient descent. We check whether our output y reaches back to the original input that equates y. If it does then our output is correct else it is incorrect.

The function of gradient descent is to find a function f(x) which is lower than the neighbouring f(x). We have to make sure that the function continues to take small steps in order to reach the lowest point. We have to keep on doing that until we reach to a point where the function f(x) no longer reduces itself any further i.e. 0. That point is known as global minima — where f(x) is lowest than all the other neighbouring points of x.

The step is what is known as the learning rate.

Lets move on and understand the rest of the part with code.

Let’s Code

Coding with Numpy

1 . Importing dependencies:

2. We will declare X — which is our input — which will contain 100 rows and one column making it a column vector.

3. We will declare y — our output — which equates to b+w.X + Gaussian noise.

Remember we are trying to learn and code how gradient descent works. So we need to declare our parameters.
Our aim is to find where not gradient descent is able to trace back the orginal parameters.

4. Visualising our data.

5. Now, lets initialise the random values for both w and b. This will make sure that through a iterative process gradient descent will trace back to the original parameter.

6. Defining our learning rates and number of epochs.

Learning Rates: it a scalar that defines the size of the step to reach global minima.

7. With new random parameters we will calculate our y. This time we will call it yhat — which will be our predicted data.

8. Lets visualise the original dataset y with predicted data yhat.

We observe that the yhat is totally out of sync with y. That’s what we have to improve by gradient descent optimisation technique.

9. Calculate mean square error. Why? Because this will tell us how wrong our algorithm is w.r.t. y and yhat.

Error = 1- accuracy

Key point to notice is that error should be close to 0.

10. In the following block of code we are trying to update the weights and bias — w and b — for each iterations.

For each iteration the value of w and b will be updates based upon the learning rate and the derivatives of b and w.

If you look closely to the variable b_grad, you will find that the operation is -2*error.mean(). That’s because the derivative of 1-error**2 is -2*error and we take the mean of all the errors that why error.mean().

11. Now, lets check the values for w and b. We got values close to 3 and 1, which is what we initialised in step 3. Go on, check.

12. Now lets calculate y with our updated w and b calling it ynew followed by visualising it.

Wow! we got a best fit line which is impressive and quite process to ponder.

So gradient descent worked its way backward to find the original parameters which was used to build the model. This is what is known as training in Machine Learning.

As ML practitioners our aim is to find parameters which relate our input to the output.

Conclusion

Gradient Descent is an optimisation technique.
GD helps an algorithm to find the true parameters that caused the output y while alter the input x.
GD has steps with enables it to reach the global minima — a point where the function f(x) is lower than the neighbouring f(x).
GD powers most of the machine learning algorithms.