Day 69: rmsprop

Tomáš Bouda
100 days of algorithms
3 min readJun 1, 2017

What about some machine learning related topic, today? Rmsprop is a gradient-based optimization technique proposed by Geoffrey Hinton at his Neural Networks Coursera course.

The concept of neural networks has been known for decades, but researchers have been failing to train any kind of slightly complex network. While there were more reasons to that, the one that was very hard to address was a gradient magnitude.

Gradients of very complex functions like neural networks have a tendency to either vanish or explode as the energy is propagated through the function. And the effect has a cumulative nature — the more complex the function is, the worse the problem becomes.

Rmsprop is a very clever way to deal with the problem. It uses a moving average of squared gradients to normalize the gradient itself. That has an effect of balancing the step size — decrease the step for large gradient to avoid exploding, and increase the step for small gradient to avoid vanishing.

I have implemented three versions of gradient-based techniques.

  • [simple] gradient descent
  • rmsprop
  • rmsprop with momentum

And I will use them to find an inversion of matrix A. The loss function will be a squared difference between AX and unit matrix I with the corresponding derivative.

Here is the plot of loss function for each method with respect to the number of steps.

I need to say, this example is non-representative. Rmsprop was developed as a stochastic technique for mini-batch learning and in this toy function has a little chance to show how good it really is.

It is no surprise that momentum is the real deal, yet, rmsprop is still doing quite well.

https://github.com/coells/100days

https://notebooks.azure.com/coells/libraries/100days

algorithm

def gradient_descent(F, dF, x, steps=100, lr=0.001):
loss = []

for _ in range(steps):
dx = dF(x)
x -= lr * dx
loss.append(F(x))
return x, loss
def rmsprop(F, dF, x, steps=100, lr=0.001, decay=.9, eps=1e-8):
loss = []
dx_mean_sqr = np.zeros(x.shape, dtype=float)
for _ in range(steps):
dx = dF(x)
dx_mean_sqr = decay * dx_mean_sqr + (1 - decay) * dx ** 2
x -= lr * dx / (np.sqrt(dx_mean_sqr) + eps)
loss.append(F(x))

return x, loss
def rmsprop_momentum(F, dF, x, steps=100, lr=0.001,
decay=.9, eps=1e-8, mu=.9):
loss = []
dx_mean_sqr = np.zeros(x.shape, dtype=float)
momentum = np.zeros(x.shape, dtype=float)
for _ in range(steps):
dx = dF(x)
dx_mean_sqr = decay * dx_mean_sqr + (1 - decay) * dx ** 2
momentum = mu * momentum + \
lr * dx / (np.sqrt(dx_mean_sqr) + eps)
x -= momentum
loss.append(F(x))
return x, loss

function

def F(x):
residual = A @ x - np.eye(len(A), dtype=float)
return np.sum(residual ** 2)
def dF(x):
return 2 * A.T @ (A @ x - np.eye(len(A), dtype=float))
A = np.array([
[2, 5, 1, 4, 6],
[3, 5, 0, 0, 0],
[1, 1, 0, 3, 8],
[6, 6, 2, 2, 1],
[8, 3, 5, 1, 4],
], dtype=float)

optimization

Check the notebook for full code with results and plots

X1, loss1 = gradient_descent(F, dF, A * 0, steps=300)
X2, loss2 = rmsprop(F, dF, A * 0, steps=300)
X3, loss3 = rmsprop_momentum(F, dF, A * 0, steps=300)

--

--