GRADIENT DESCENT

Nikita Malviya
Praemineo
Published in
8 min readDec 11, 2018

Gradient descent is a first order iterative optimization algorithm used to find the minimum of a function, that is to minimize the error. Using gradient descent algorithm we try to reduce error at each and every point in the plane by reducing the error term.

Gradient Descent: Error Formula

Lets understand where gradient descent works, what it does, and finally how it works. This blog includes the application of gradient descent, it’s derivation and finally the implementation with an example. The prerequisites is the basic understanding of neural networks.

Let’s consider the projection of 2-D straight line fitting. Suppose y = f(x) the given set of data, suppose the problem is to fit a straight line through the given data, the fitting of straight line is modeled using the linear neuron.

Line in 2-D plane and it’s neuron model

So we have a single neuron which worked on the linear activation unit. Linear activation unit means, Output = Summation( Wj . Xj ) , for j=1 to n

In this case : Output =W1 * x1 + W0, where output is summation of product of weights(Wi) and inputs(xi) along with a bias(W0).

As long as it’s a single variable like ‘y’ is a function of ‘x’, we know that it it a line fitting. We’ve to tune the 2 parameters only that’s the slope and the intercept. So, it is a two dimensional fitting problem consist only two variables which are ‘x’ and ‘y’.

Let’s consider the projection of 3-D straight line fitting. Suppose ‘y’ is dependent upon several variables. Let us say that ‘y’ is a function of (x1, x2)

How it can be model into the neural network ? Simply we will be having a neuron with two inputs x1 and x2, and the third adjustment will come from the bias. So, we’ve three weights to tune, i.e is W0 (bias), W1 and W2.

Line in 3-D plane and it’s neuron model

It’s a 3D straight line fitment and we are breaking it up into the fitment of two slopes.

  • W1 is one slope component along the x1-y plan.
  • W2 is the other component of the slope along the x2-y plane.
  • On top of it we also got W0 (bias) added to it.

In case y = f(x1, x2) it becomes 3-dimensional line fitting problem. So, it is tuning of three such parameters. If we tune these 3 parameters we can fit the 3D straight line.

Output= W1 * x1 + W2 * x2 + W0

Lets consider y(x) = (x1, x2, x3, ….,xn) variables, so in this case it becomes N-dimensional fitting problem.

** Now up to 3D we can bring into our visualization,but anything more than 3D, we cannot bring it into our visualization.

The neuron model for ’n’ inputs will look like…

Neuron model for ’n’ inputs in N-dimension

For this case : Output= W0 + W1 * x1 + W2 * x2 + …+ Wn * xn

We have fitted an N-dimensional straight line which is having so many gradients(slopes), gradient along x1-y plane, gradient along x2-y plane and so on.

So, if we adjust and combine, then it is the problem of fitting an N-dimensional straight line which we cannot bring into our visualization. We had a number of observations where the fitted straight line is not passing through all the observations, indeed make some errors. Some error values are positive, some are negative. We are defining the error that we are doing in the process fitment. So, the problem is to fit the line which make some error but in order to get combined error we added it up.

** Best Line Fitment = Minimum Error

Working of Gradient Descent

We are placing some object, let’s fit a curve in 3-D .

We’ve to adjust W0, W1 up to Wn , all these ’n’ parameters simultaneously,
thus the problem is extended to an
N dimensional surface, where the model will produce errors in fitting ’n’ different points simultaneously. The movement should be negative of the direction of the gradient.

Why the movement should be negative or opposite of the direction of the gradient descent ?

As an example, let’s say you are hiking up a mountain. Imagine the top of the mountain is to the north, so the gradient point towards north having magnitude of 0.8, meaning that for each meter you move north, you will rise 0.8 meters. So if you walk in the opposite direction, the rate of change will be -0.8, and that is the minimum of all possible rates of change. If you walk east or west, the rate of change will be 0, which would be the minimum possible magnitude for the rate of change. But -0.8 is less than 0 and is minima.

GLOBAL AND LOCAL MINIMA :

Local vs Global Minima

LOCAL MINIMA : A minimum is said to be local if it is the smallest value of the function within a given range.

GLOBAL MINIMA : A minimum is said to be global if it is the smallest value of the function on the entire domain of a function.

NOTE :

There is only one global minimum but there can be more than one local maximum or minimum.

COST FUNCTION :

In ML, cost functions are used to estimate how badly models are performing. Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and Y. This is typically expressed as a difference or distance between the predicted value and the actual value. The cost function (you may also see this referred to as loss or error) can be estimated by iteratively running the model to compare estimated predictions against “ground truth”- the known values of y.

EXAMPLE : As children we typically learn what is “right” or “good” behavior by being told not to do things or being punished for having done something we shouldn’t. For example, you can imagine a five year-old child sitting by a fire to keep warm, but not knowing the danger of fire, she puts her finger into it and gets burned. The next time she sits by the fire, she doesn’t get burned, but she sits too close, gets too hot and has to move away. The third time she sits by the fire she finds the distance that keeps her warm without exposing her to any danger. In other words, through experience and feedback (getting burned, then getting too hot) the kid learns the optimal distance to sit from the fire. The heat from the fire in this example acts as a cost function — it helps the learner to correct / change behavior to minimize mistakes.

DERIVATION OF GRADIENT DESCENT

So lets derive the error formula for better understanding. Consider ’n’ points X1, X2, .., Xn, the error formula is :

where prediction ^y is given by :

(1)

Our goal is to calculate the gradient of E at point X=(X1, X2,..,Xn) given by partial derivative:

** to simplify calculations, we’ll actually think of error that each point produces, and calculate the derivatives of this error.

To calculate derivative of this error w.r.t weight, first calculate derivative of ^y w.r.t weight, and for that we need to calculate the first derivative of sigmoid as well which will be used in ^y.

(2)
using (1) and (2)
  • only Xj left because Xj is not constant only w.r.t Wj

To calculate the derivative of the error E at point X w.r.t. weight Wj :

differentiated error formula to find the gradient
Similar Calculation for Bias

Why we take derivation ?

When updating the curve, to know in which direction and how much to change or update the curve, depending upon the slope.

IMPLEMENTING GRADIENT DESCENT

Let’s take an example to implement the gradient descent algorithm. Suppose y = (x+5)² a function, and our current place at the plane is x = 3. The task is to find the minima of the function.

For finding minima consider y=0, then x = -5.

So our task is to converge the current point x = 3 to bring it to minima of the curve that is x = -5 from the above function.

We can find the gradient using formula :

current_x= current_x - learning_rate (previous_value of_x), this will give the gradient.

Go to my GitHub repository for implementation of code where you can observe that each and every iteration gradient value to reach its minimum.

https://github.com/praemineo/gradient-descent.git

LEARNING RATE :

I used the learning rate in the gradient descent algorithm, so lets find the meaning and existence of learning rate.

The learning rate is how quickly a network abandons old beliefs for new ones.

For simple example, If a child sees 5 examples of apples and all of them are red, it will think that apples have red color and will look red color when trying to identify a apple. Now she sees a green apple and her parents tell her it’s an apple (supervised learning). With a large “learning rate”, it will quickly realize that “red apple” is not the most important feature of apples. With a small learning rate, it will think that this green apple is an outlier and apples are still red.

The point is that a higher learning rate means the network changes its mind more quickly. If the learning rate is too high, it might start to think that all apples are red even though it has seen more red apples than green ones.

In general, you want to find a learning rate that is low enough that the network converges to something useful, but high enough that you don’t have to spend years training it(iterations with too small learning rate).

I hope you understood the need and working of gradient descent. “Am I missing anything? Let me know in the comments and I’ll add it in! ”. If you enjoyed this blog post, share it with your friends! :)

--

--