The journey of Gradient Descent — From Local to Global

Feb 25 · 5 min read

In my previous article, I covered the intuition of gradient descent with the implementation of the mathematics behind it to reach the minimum value of cost function. It seemed as easy as walking down the hill (Yeah that’s what gradient descent is).

The last article was wound up with the line: sometimes, it may happen that instead of reaching global minima, the value of cost function gets stuck at the local minima or the saddle point.

Also I would like to show you something which may ruin your enthusiasm you had.

In Deep Learning, there are different types of cost function which have a different shape other than this. Thus this simple shape of cost function does not exist while dealing with different cost functions.

Instead, it is mostly with many convex surfaces, thus it looks somewhat like this :

So as shown, The Cost function actually has a lot of local minima and we need to get to the global minima.

So let’s begin.

We cannot totally avoid getting stuck at local minima or the saddle point, but still, we can use several techniques in order to help mitigate this problem.

Stochastic Gradient Descent (SGD) and Mini-Batch SGD

In gradient descent, the value of the cost function decreases gradually, which increases the chance of meeting local minima and makes it impossible to get out of that point. Whereas in SGD, the variation is abrupt. Thus it reduces the probability of getting stuck at local minima, and even if it gets stuck, chances are that it will definitely come out due to its jolting movement.

Regularization

L1 Regularization (Lasso)

Here, When the derivation of J(Y’, Y) will be zero at local minima, the regularizer term will penalize the cost function, and hence the parameters will get updated even if they are at local minima, and at global minima, the updation will not take place, thus we can avoid getting stuck in local minima using Regularization.

Momentum

We can write the weight updating equation as follows :

Here β is the momentum factor and it ranges between 0 and 1 (both not inclusive) it is again a hyperparameter that can be tuned.

I found a good visualization of momentum on YouTube by AIQCAR, Do watch it.

Altering the Learning Rate — α

The equation of altering learning rate can be represented in the simple form

Here, T and α_o are hyperparameters that can be tuned. Here t can vary from 0 to T and hence α has an inverse relationship with t. Here α changes until t hits T, during this time, the model is said to be in a searching phase, Thus, the learning rate decreases with time, and hence movement along the error surface can be made smoother.

Winding-up

Here is an illustration of the converging speed of different algorithms.

Here we come to an end.

See you in the next story.

Thanks.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

Medium sent you an email at to complete your subscription.

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Blackface in the Age of Facial Recognition

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app