Briefly Gradient Descent

Shakil Ahmed Sumon
3 min readMar 8, 2019

--

Previously we talked about optimizers. Here you can recap if you want to.

As you are here, probably you already know that gradient descent is a minimization algorithm. It minimizes a given function which in machine learning, is our very own cost function. Gradient descent minimizes the cost function by updating the learnable parameters of the machine learning model. Let’s say, θ is our learnable parameter and J(θ) is our cost function. Gradient descent will update the parameter with this rule.

θ := θ — α ∇J(θ) ;

where α is the learning rate and is the gradient of the cost function with respect to θ. The gradient is the partial derivative which is calculated with respect to multiple variables whereas with one variable it will be just normal derivative. I have kept the gradient sign here to avoid the confusion since in machine learning we deal with many parameters.

A gradient, in other words, is the slope of a function. The magnitude of the gradient resembles the steepness of the slope and the direction points us to the course the function is increasing. We move the parameters by the magnitude of the gradient towards the opposite direction given by it as our goal is to move downwards.

Learning rate α is a constant which plays a very important role in gradient descent. It determines how big or small the jump would be from the current position towards the desired direction. Learning rate is a hyper-parameter which needs to be set manually and I confess, it is a very tricky task to tune this particular hyper-parameter. If the learning rate is set to be big, gradient descent tends to overshoot and posses a very high risk of missing the minima. If it is set to be very small, the model will take a lot of time to converge which is not computationally efficient.

There are some approaches that we might take while implementing gradient descent. One of them is called batch gradient descent which updates the parameters after evaluating all the samples in the dataset. This approach is computationally efficient but in case of a non-convex function, it tends to get stuck on sub-optimal local minima or in saddle points. To know what is saddle point, check this:

On the other hand, stochastic gradient descent updates the parameters after evaluating each of the training samples. This approach is computationally costly and these frequent updates produce noisy gradients which sometimes prevents the loss function to minimize. There is another way in between these two approaches which is called mini-batch gradient descent. This algorithm splits the dataset into small chunks of data. It trains the model on those chunks and then updates the parameters, one at a time. These chunks are popularly called mini batches in machine learning. As for every mini batch, the optimization starts from a different point, this mini batch gradient descent tackles the problem of sub-optimal local minima to some extent but there is still scope for improvements.

We will discuss about optimizing gradient descent in our next post.

--

--

Shakil Ahmed Sumon

Specialist | AI Innovation and Planning | Robi Axiata Limited