GRADIENT DESCENT FOR DUMMIES

Lankinen
Lankinen
Sep 2, 2018 · 4 min read
“landscape photography of valley” by eberhard grossgasteiger on Unsplash

Gradient Descent (GD) is the most important thing in machine learning because it modifies the parameters to the right direction (this makes an algorithm to learn). In this article, I will explain GD as simply as possible.


Cost Function

Simply said, the idea of GD is to minimize cost function (some people call it loss function). Cost function is the difference between predictions made by an algorithm and the right value. When the idea of GD is to minimize cost function it means that we try to make predictions that are more close to the real values. This difference can be calculated in many ways but according to me the most popular method is mean square error (MSE).

Mean square error

Notations:
y = real value
ŷ = prediction of function

The idea in this equation is to first calculate the difference between predicted y (=ŷ) and real y. Then we just square that difference (number multiplied by itself). Purpose of the first part of the equation is to just calculate the mean of losses.


Minimizing Cost Function

We can plot model with cost function and the parameters. The number of parameters can be millions but for this example let’s assume that we have only two parameters.

gradient descent

The idea of gradient descent is to find the lowest point in this function, where mean square error is closest to zero. And we want mean square error to be as low as plausible because then the algorithm is predicting almost real y-values. Initially, you define a random x and z values which take you somewhere in the model. In our model it is about -50, 50 and loss function in that point is approximately 200 K. Now, GD looks into every direction and try to find the direction where downhill is steepest (mathematically it is calculating the gradient). From the new point, it again calculates gradient (steepest downhill) and modifies parameters using that knowledge. GD will continue this until it reaches the point where there is no downhill. In the picture blue line represent the move of GD. And yes, an algorithm can randomly choose the right parameters the first time but I have never seen that happening in real life.


For Pros (I recommend to read this even if you are a dummy)

Step size (some people call it learning rate) parameter which tells you how much GD go downhill every step. After every step, GD looks a new direction (where is the steepest downhill). For years the problem have been to determine the step size because if you choose too large number GD might go over the lowest point or if the learning rate (remember this is the same thing as step size just different name) is too low it moves slowly and take too much time to find the lowest point (you can also say “it takes more time to converge”). However, this problem has been resolved. Leslie N. Smith explained in Cyclical Learning Rates for Training in Neural Networks, explained how he found out a great way to determine the learning rate. He started with a really small learning rate (e.g. 0.000001) and then increased it by multiplying it with 2 every iteration until it reached some big number (e.g. 5). The idea is that GD initially moves really slowly and eventually grows too big. Then put learning rate to the y-axis and mean square error (or some other cost function) to the x-axis and plot the graph.

Above there is an example graph. Start by looking where is the lowest loss. In our case it is about 10⁻¹ (learning rate). Then move a little bit to the left and take a point where gradient (how fast the line is going down) is biggest and take that to your learning rate. Based on this graph, 0.003 would be a good step size for me. This method is really effective even though it isn’t popular.

Here was briefly explained the gradient descent. However, the gradient descent has been developed a lot further. This can help beginners practice and understand the basic concepts, but if you are interested to use it for bigger problems you should read about Adam. Adam is a method which speeds up gradient descent drastically. Thanks for reading and if someone knows a great article about Adam I can link it here. Just email me or write a comment.

~ Lankinen

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade