Optimization methods in machine learning

Matthew Wasserman
6 min readAug 21, 2019

--

The “learning” in machine learning is just another way of saying optimization. All of the models that we build learn to generalize to other data points by optimizing a function of weights and biases(and maybe some other parameters in more complicated networks).

When you are teaching a complex deep neural network with multitudes of nodes, each having its own weights and biases, speed is king. Do you really want to be waiting days, or even weeks to train your model?

Gradient descent is king

Complex function

When looking at the complex equation of backpropegation, it is clear that we cannot find an analytical solution to the derivative of the optimization function. We need a numerical method that ideally only works with the first order derivatives since we do not want to add an extra computational step in calculating an extra derivative 2nd order derivative(ex: BFGS).

(Only having the first order derivative will lead up to the problem of saddle points but that topic is for another blogpost.)

I will discuss a few optimizers and show why their convergence speeds differ. Here is a handy dandy visualization.

Vanilla Gradient Descent

I’ll start with the first numerical solution that you will probably have learned in your Machine Learning journey.

Andrew NG

At each step the gradient is calculated at every data point and the parameters are recalculated by subtracting that value, multiplied by a learning rate, from the parameter and then repeating until we find (hopefully) the global minimum.

What’s wrong with gradient descent?

I did not bold “at every data point” above by accident. With the massive amounts of data points and nodes in most deep learning models calculating the gradient on the entire dataset at step of the algorithm is very computationally expensive.

This means it can be quite a slow process to train our model.

Batch-itize it!

A way to speed up gradient descent is to create mini batches of your data set. This will allow us to go through multiple iterations of gradient descent for each epoch. An epoch is on complete run through all the training example, so if you have n batches you will have n iterations in each epoch.

Mini batch gradient descent is done by splitting up your data into equally sized partitions. The general rule of thumb is to have your batch size be a power of 2, as this makes the computers job a bit easier.

The algorithm is essentially the same as vanilla gradient descent but with the gradient calculated on a mini-batch and not on the entire data set on each iteration.

credit: Andrew NG(GOAT)

Mini Batch is faster than vanilla, but the path towards the minimum will be much noisier. This happens because, depending on the batch, you may get outliers or points where the gradient is further away from the minimum so you may spike up too much away from the minimum.

Stochastic Gradient Descent

SGD takes gradient descent to the next level. Instead of going through each batch of the data, you choose a random batch to use at every single iteration.

Speed up SGD with MOMENTUM

Did you notice how noisy all the curve is while it reaches the minimum? Momentum can fix this, while at the same time speeding up convergence.

First you need to understand exponentially weighted moving averages

Applying an EWMA smooths out the noise in the curve. If you apply this to stochastic gradient descent the rate of convergence will still be noisy in the initial iterations of the algorithm but it will smooth out as it reaches convergence.

As you can see in the implementation above, each iteration of the algorithm, in both the gradient of the weights and biases contains a factor of 1-β of the previous steps gradient in it.

From the starting point we still have a jagged noisy curve, which becomes less noisy over time.

RMSprop

RMSprop was created by Geoffrey Hinton and and disseminated not from an academic paper, but in a lecture of his Coursera MOOC.

Andrew NG

The weights and biases gradients are still calculated the same way as momentum but it squares(element-wise) the gradient and when it updates it takes the square root of the EWMA. It leverages the fact that the gradient of the weights will be small and the gradient of the bias will be large. This will speed up convergence by pushing speeding up the horizontal component of the step and smoothing out the curve by shrinking the vertical component.

ADAM

ADAM takes the best of both RMSprop and Momentum and puts them together.

ADAM stands for Adaptive Moment Estimation, since it calculates the 1st and 2nd moment using parameters β1 and β2(below)

Andrew NG

In the first stage we use a Momentum with parameter β1 and a RMSprop with parameter β2. Second, we do a bias correction of the EWMA, which will smooth out the initial iterations of the moving average. Finally when we update the W and B we apply the update style of RMSprop.

Conclusion

There are many different optimizers you can use with your deep learning model, and you may find one works better than the other. New optimizers continue to come out rapidly.

Just last week Rectified ADAM (RAdam) was proposed, which I will link to below.

Special thanks to Andrew Ng’s deep learning Coursera MOOC which you can find a more detailed explanation for all of the optimizers I discussed.

--

--