Artificial Neural Networks, Part 2 — Understanding Gradient Descent (without the math) | by Saket Chaturvedi | Analytics Vidhya

Published in

Analytics Vidhya

5 min readMay 24, 2020

Artificial Neural Networks, Part 2 — Understanding Gradient Descent (without the math)

In post 1 of this series, we went over the basics of artificial neural networks, the related components like nodes, weights, bias, activation functions, etc. This post is about, another major topic. I will explain the concept of Gradient Descent, without using a lot of math.

The main motive behind gradient descent is to update the weights and biases to reach a point where cost/loss function value is minimum. This results in a model that is capable of almost correct predictions. Which means, we are trying to optimize the loss function.

First, let's go over some of the key terms —

Cost Function — This is also called as Loss Function. This is the error that we want to minimize. This is dependent on the input, weights, biases, and the error.
Derivatives — Slope at a point x. It shows the rate of change of a function based on another variable. This is expressed as dy/dx. These are helpful in knowing how to change the value of x to make the required change in y.
Global Minimum — The point where the value of the cost function is almost zero. There is always one global minimum.

Local Minimum — The cost function is assumed to be a convex function but it rarely is. There would be a lot of points that will have a minimum cost function. There can be more than 1 local minimum.
Forward Propagation — The input gives the initial information and then moves through each layer of the network towards the final output layer.

Backpropagation — The final layer computes the error and the error flows back through the layers to compute the gradient with respect to other variables. Gradient descent uses this gradient value to perform the learning.
Partial Derivatives — In the cases where multiple inputs are involved and the function f is not just a function of x but contains multiple variables. It calculates the change in f by changing one of the multiple variables keeping everything else constant.

To understand the process, consider the image on the left. Let's assume we have a function f(x) that we want to minimize. Our process starts with a random point (A) and the goal is to reach the lowest point (B). To reach point B, we take small steps and calculate the derivatives to understand how to change the input to get the corresponding change in the output.

But how is the direction of the movement decided? The direction of movement is decided by the value of the x.

x<0, then the derivative of f(x) will be less than 0. The point will be on the negative side in the image above and we move in the right direction.
x>0, then the derivative of f(x) will be greater than 0. The point will be on the positive side in the image above and we move in the left direction.

This is a step by step process and the step size to take towards the minima point is called the Learning Rate. You can definitely have higher values of learning rate and converge faster, but, with a high chance of overshooting the minima and skipping it altogether.

With a lower learning rate, it can take a long time to converge.

The value of the learning rate can be selected by using various methods. One approach is to set is a small constant value, another is by using a set of values and then selecting the one which results in the optimal output.

So, Gradient descent involves taking small moves in an iterative fashion which results in the best configuration which can help in reaching towards an optimal solution.

In the cases where we have multiple inputs, which is the case in real problems, we work with partial derivatives. Here the function is not dependent on just the value of x, but also on the other variables involved. The partial derivatives capture the change in the value of function f(x), which is a function containing multiple variables, with respect to just the change in the value of one of those variables keeping everything else constant.

In the case of Loss function being a function of weights (w), biases (b), Error (E), and Input(I), the partial derivative would capture the change in the output of the loss function by changing one of these keeping everything else constant. For example, capturing the effect of changing weights (w) on the output of loss function. This would help in getting the values of weights which results in optimum cost function output.

However, Performing Gradient Descent over the whole dataset in iterative fashion can become computationally expensive when the dataset is really huge as the time to take a single step becomes long. To mitigate this, Stochastic Gradient Descent is used. It treats the gradient to be an expectation and the expected value can be calculated using a small subset of the actual data. At every step, a minibatch of data is drawn uniformly from the training set and the gradients are computed which are used to update the values on the larger dataset. This provides a benefit over Gradient Descent in terms of faster convergence towards the minimum and finds the value of the cost function which is useful but not necessarily the best.

Let us go over the steps to train a Neural Network with a flowchart —

When implementing NN with the use of a library like Keras and Tensorflow, the implementation of backpropagation and optimization if taken care of by the library and you will not need to worry about it. What you would need to identify is, the number of layers, selecting the right activation functions, loss functions, and data preprocessing.

In the next part, we will get into the implementation of NN using Keras and Tensorflow.

Published in Analytics Vidhya

Written by Saket Chaturvedi