Gradient: the concept for beginners

Published in

unpack

5 min readApr 5, 2021

In this report, we will see the concept behind Gradient , and how to apply for the quadratic function . Finally, the various forms of Gradient used today will be exposed.

WHAT IS A GRADIENT?

“A gradient measure how much the output of a function changes if you change the inputs a little bit.” — Lex Fridman (MIT)

In other words, Gradient is also known as the slope of a function. The steeper the slope, the more quickly a model learns. If the slope is zero, however, the model will stop learning. In mathematics, a gradient is a partial derivative with respect to its inputs.

Gradient descent is currently the most widely used optimization technique in machine learning and deep learning. It is used to determine parameter values of a function (f) and help when the parameters cannot be determined analytically (for example, using linear algebra).

This technique is also simple to understand and implement. Everyone who deals with machine learning should be familiar with the term.

Let’s see how to use gradient descent in machine learning or deep learning problem.

There are 7 essential steps required to build any models

Initialize the weights.
Prediction
Calculates the loss to know how good the model is.
Calculate the Gradient to minimize the loss
Step (easy way to determine if weight should be increased or decreased)
Return to step 2 and repeat the procedure
Iterate until you reach a point where you want to end the training process.

These 7 steps are the key to the training of all deep learning models.

Let’s take a look at how they appear in a simplified scenario. Let’s start by defining a straightforward function, for example, the quadratic function F(x)=X **2 and pretend this to be our loss function, and x is the function’s weight parameter:

How do we minimize the loss?

To minimize the function above, we must find X’s value that results in the smallest value of Y, which is represented by the red dot. Since this is a 2D graph, finding the minima is relatively simple, but this is not always the case, especially in higher dimensions. In such cases, we must formulate a strategy for finding the minima known as Gradient Descent.

For example, The procedure we discussed earlier begins with initializing a random value for a parameter and calculating the loss value:

We first initialize x to 1.8, and we can see that the loss ( red dot )is between 3–3.5.

Now we’ll investigate what would be the result if we keep increasing or decreasing our parameter.

If we increase the value of x to 2, we can see that our loss function increase.

If x take as value 1.3, the loss function decrease, as you can see in the figure

Based on the specific task, you might increase or decrease, but once you’ve decided which direction to take, you can take a large or small step to reach your destination.

Note :For this example, decreasing leads to minimizing the loss function.

Overall, all this concept can be traced back to Isaac Newton, who demonstrated that we can use it to optimize any function. However, the speed-up process is done by finding better steps through the learning rate.

With the assistance of derivatives, the Gradient Descent Algorithm enables us to make these decisions efficiently and effectively.

A derivative is a calculus concept that refers to the slope of a graph at a particular point. A tangent line to the graph at the point is used to represent the slope. Thus, if we can compute this tangent line, we will determine the desired path for reaching the minima.

You forget about derivative, never mind :p Our savior, PyTorch, is here.

How all of this is done by Pytorch

PyTorch is capable of computing the derivative of nearly any function automatically! Additionally, it accomplishes this task quickly.

To begin, let us choose a tensor value for which we want gradients:

xt = tensor(3.).requires_grad_()

requires_grad_() means we want to calculate gradients with respect to that variable at that value

Now, let’s calculate the function with that given value.

PyTorch will calculate the gradients for us, and we can view it by checking the grad.

That’s correct because the derivative of x**2 is 2*x; therefore, 2*3 = 6

Learning rate

The Learning rate(lr) is the number of steps required to achieve the minimum. A large area can be covered with a big step, but this may lead to overshooting the minima. On the other hand, small steps will make the computation longer. The learning rate is that small value multiplied by the Gradient. It is usually measured between 0.001 and 0.1, but it can be any value.

w -= gradient(w) * lr

GOOD TO KNOW THE DIFFERENT TYPES OF GRADIENT DESCENT

There are three widely used forms of gradient descent, which differ primarily in the amount of data they require:

BATCH GRADIENT DESCENT

Batch gradient descent, also known as vanilla gradient descent, calculates each example’s error in the training dataset. But it does not update the model until all training examples have been evaluated. This whole process is analogous to a cycle, and it is referred to as a training epoch.

STOCHASTIC GRADIENT DESCENT

In comparison, stochastic gradient descent (SGD) does this for each training example in the dataset, updating the parameters one by one. This can make SGD faster than batch gradient descent, depending on the problem. One benefit is that the regular updates allow us to keep a reasonably detailed track of our progress.

MINI-BATCH GRADIENT DESCENT

Mini-batch gradient descent is the preferred approach since it incorporates SGD and batch gradient descent principles. It simply divides the training dataset into small batches and updates each batch. This strikes a balance between stochastic gradient descent’s robustness and batch gradient descent’s performance.