RMSprop

There is an algorithm called RMSprop, which stands for root mean square prop, which can also accelerate gradient descent. RMSprop uses the same concept of the exponentially weighted average of gradient as gradient descent with momentum but the difference is parameter update.

How it works ?

Consider an example where we are trying to optimize a cost function that has contours like the one below and the red dot denotes the local optima (minimum) location.

We start gradient descent from point ‘A’ and we through end up at point ‘B’ after one iteration of gradient descent, the other side of the ellipse. Then another phase of downward gradient can end at ‘C’ level. With through iteration of gradient descent, with oscillations up and down, we step towards the local optima. If we use higher learning rate then the frequency of the vertical oscillation would be greater.This vertical oscillation therefore slows our gradient descent and prevents us from using a much higher learning rate.

The ‘bias’ is responsible for the vertical oscillations whereas the movement in the horizontal direction is defined by the weight. If we slow down the update for ‘bias’ then the vertical oscillations can be dampened and if we update ‘weights’ with higher values then we can move quickly towards the local optima still.

Implementation

We use dW and db to update our parameters W and b during the backward propagation as follows:

W = W — learning rate * dW

b = b — learning rate * db

In RMSprop we take the exponentially weighted averages of the squares of dW and db instead of using dW and db separately for each epoch.

SdW = β * SdW + (1 — β) * dW2

Sdb = β * Sdb + (1 — β) * db2

Where beta ‘β’ is a different hyperparameter called momentum, ranging from 0 to 1. To calculate the new weighted average, it sets the weight between the average of previous values and the current value.

We’ll update our parameters after calculating the exponentially weighted averages.

W = W — learning rate *dW / sqrt(SdW)

b = b — learning rate * db / sqrt(Sdb)

SdW is relatively small so here we divide dW by relatively small number while Sdb is relatively large so we divide db with a comparatively larger number to slow down the changes on a vertical dimension.

How to choose Beta?

  • The momentum (beta) must be higher to smooth out the update because we give more weight to the past gradients.
  • Using the default value for β = 0.9 is suggested but can be tuned between 0.8 to 0.999 if needed.
  • Momentum takes into account past gradients so as to smooth down gradient measures. It can be implemented with descent by batch gradient, descent by mini-batch gradient or descent by stochastic gradient.

References:

Deep Learning Specialization by Andrew Ng

--

--