Want your model to converge faster? Use RMSProp!
This is used to speed up Gradient Descent.
Stuck behind the paywall? Click here to read the full story with my friend link!
This article is actually a continual of previous series of articles. Here are links to the stories, if you will to come along.
Now, who doesn’t like Deep Learning? I suppose you do or why else would you be reading this article. But Deep Learning is just starting out these days, there ALOT that has to be discovered in the field.
Despite the rapid advancements and studies being done in the field, there’s is still a ton of stuff that we need to unveil. GPT-3 is an example of what the future could look like!
Now, we generally use Deep Learning when we deal with images because, in my opinion, that’s primarily where, Neural Nets and specifically Convolutional Neural Nets shine. And there is data everywhere today. Think of a project right now.
‘Credit Card Fraud Detection’? ~ search on google and find several datasets and/or models already up and running.
‘Car detection’? ~ already up and running.
‘Face Recognition’? ~ same.
You see? people are doing whatever it takes to search and come up with new and improved algorithms to solve problems. There is constant research being done in the field and still, Machine Learning will have exponential demand in near future.
RMSProp (Root Mean Square Prop)
RMSProp is another famous optimizer.
Root Mean Square Prop or RMSprop is using the same concept of the exponentially weighted average of the gradients like gradient descent with momentum but the difference is in the updating of parameters.
Now, the center red dot is the global minimum and is basically what the Algorithm/Machine Learning model is trying to reach. It takes these steps which get it closer to the minimum region.
The steps are represented by the blue lines here. We can see that the steps are oscillating. With each step, the model reaches closer to the minimum region.
Now, larger steps are acceptable at the start but as we progress, we need smaller and smaller steps in order to reach the center region and staying there!
Hence, we need to decrease the magnitude of the steps we take with time, or else the model will surpass the needed region and will perform poorly.
“What causes these oscillations?” ~ you might ask. Remember we add a Bias term to our “Wx + B” equation, the oscillations are due to the bias term. The movement and direction of the movement is determined by the Weights.
If we slow down the update for bias then we can damp out the vertical oscillations and if we update weights with higher values then we can still move fast towards the minimum point.
We know that the normal back-propagation process that we take is:
W = W – learning_rate * ∂Wb = b – learning_rate * ∂b
But in RMSProp, instead of taking ∂W and ∂b, we use the idea of exponentially weighted averages and find S∂w and S∂b first:
S∂w = ß * S∂w + (1- ß) * (∂W)^2S∂b = ß * S∂b + (1- ß) * (∂b)^2
here, ‘ß’ is another hyper-parameter which takes in values [0, 1], it basically controls the weighted averages value. And now, to update the original W and b:
W = w - learning_rate * (∂w/sqrt(S∂w + ∑))B = b - learning_rate * (∂b/sqrt(S∂b + ∑))
Where the squaring is element wise and ‘∑’ is Epsilon, where:
∑ = 10^-8.
“Why are we adding ∑?” ~you might be asking. Well, suppose the square root of S∂w or S∂b comes out to be 0. Then, if we divide ∂w or ∂b from 0, we’ll have infinity, which is not what we want. Hence, to avoid such mistakes, we have the ∑, which is just there to make sure that the division is never carried out by 0.
Blue line is normal GD, & Green line is with RMSProp. We can see that its easier to reach the minimum region using this.
What we can observe from this graph is that the RMSProp, the black line, goes straight down, it doesn’t really matter how small the gradients are, RMSprop scales the learning rate so the algorithms go through the minimum region faster than most.
RMSProp is a very powerful and famous optimizer. It is so popular that only Adam optimizer has surpassed, making it one of the most used Optimizer algorithms in the era of Deep Learning.
Alright, I hope this article helps you! Let’s connect on Linkedin.
Training Taking Too Long? Use Mini Batch Gradient Descent
Use this optimization to speed up your training!
If you want to keep updated with my latest articles and projects follow me on Medium. These are some of my contacts details:
Happy Learning. :)