Optimizers for machine learning
Introduction
In this we are going to learn optimizers which is the most important part of machine learning , in this blog I try to explain each and every concept of Optimizers in simple terms and visualization so that a beginner also can understand.
This blog is limited for machine learning, after reading the blog, you will know how Optimizers works.
What we are going to learn
- What is optimization
- Convex and non-Convex functions
- Optimizers
- Basic Calculus
- Gradient descent (Batch gradient descent)
- Drawback of Gradient descent
- Stochastic Gradient descent (SGD)
- Mini-batch Gradient descent
- Challenges while performing optimization
Optimization
First of all we have to know what is optimization, in mathematical terms finding the global minima or maxima of any non linear function is known as optimization . Here finding minima and maxima is a problem specific
Convex and non-Convex functions
In machine learning and deep learning we mostly encounter with these type of functions. If we take any two random points on graph of a function and draw a line segment (shortest path) between those two points, if the line segment is above the graph of a function such type of functions we called as convex function , for convex functions shortest path between any two point should be above the graph otherwise it is a non-convex function.
A function can have more than one minima and maxima. The absolute highest and lowest points of the function including the end points are known as global maxima and global minima, the neighborhood of global maxima and global minima are known as local maxima and local minima.
For convex functions, we have only one global minima or maxima there is no local minima or maxima but in non-convex functions we can see local and global minima’s and maxima’s . In deep learning we encountered more complex non-convex functions
Optimizers
There are some algorithms which can help to find global minima of a function.
Here why should we find global minima why not global maxima?
In machine learning and deep learning, most of our cost functions defined in such a way that to find minima this is just for our convenience which makes math very easy , for example take linear regression we have to minimize (minima) cost function ∑ (yᵢ- h(w*xᵢ+ b))² , if you want you can change your cost function in such away that to find global maxima then below algorithms have to be modify to work for your problem.
However let’s see what are those algorithms which help in optimization( global minima) of any function.
- Gradient descent (Batch gradient descent)
- Stochastic Gradient descent (SGD)
- Mini-batch Gradient descent
Basic calculus
Before going into Optimizers we have to know some basics of calculus.
In mathematics each and every line and function has a slope but what is slope
- the rate of change of y with respect to x is known as slope
- lets assume if x1 changes to x 1→x2 and y1 changes to y1 →y2
- then slope(m)=(y2-y1)/(x2-x1)
- slope tells you how much change in y by change in x
lets recall basics of slope which we learned in our schooling
→ How do we find the slope of line?
- the angle made the line with x-axis is known as slope (m)
where m=tanθ ; θ=angle
→ Another way to find the slope of line if you know two points on a line (x1,y1) ,(x2,y2)
then slope (m)=(y2-y1)/(x2–x1)
But when comes to continuous function, slope changes from point to point we cant use above formulas we have to find another way to find a slope of continuous function. If you don’t know what is continuous function I suggest you please go to Wikipedia or khan academy.
Lets come to our discussion how to find a slope of continuous function at any particular point , see here for every continuous function we draw a tangent at particular point where we want to find slope or derivative, don’t worry derivative is the term used instead of slope in calculus, the angle made by that tangent with x-axis is known as slope or derivative at that particular point and we can also define derivative using limits, let assume we have function f(c) when c changes to c + h , the function f(c) changes its value to f(c + h)
c →c + h , f(c) →f(c + h)
then slope =f(c + h)-f(c)/(c + h-c) (our basic slope formula (x2 – x1) / (y2 – y1))
Where slope=(f(c + h)-f(c))/h
What if the limit of h tends to approximately zero but not zero, means ‘h’ changes to very little we can assume as 0.0001 or any point value
For clear understanding if we use our original formula of slope
(y2-y1)/(x2-x1) — -(1)
assume h=0.0001 , slope=f( c + h )-f(c)/c + h-c
the slope =f(c + h)-f(c)/h =f(c + 0.0001)-f(c)/0.0001 — — — -(2)
compare (1) and (2)
x2-x1=0.0oo1 , y2-y1=f(c + 0.0001)-f(c) ,
there is point change in x so there will be point change in y
( small change (point change) in c ->c+0.01 the function changes very little f(c) -> f(c+0.0001) , so the slope at c+0.0001 is almost equals to slope at c)
In the above figure you can see when h is very small it almost equal to point c
Therefore slope (m) at point c =f(c + h)-f(c)/h ; where limit h — -> 0.0001 or you can take any point value
In calculus, we call slope as derivative (dy/dx) or gradient
Every time we don’t use limits to calculate derivative of a function. For speed calculation of derivatives or slope of a functions we have to know some basic formulas in differentiation. I strongly suggest that if you don’t know differentiation please recall basic formulas of differentiation which we learned during our schooling because by using those formulas we can directly calculate derivative of a functions at any point.
Gradient descent
This is the father of all optimizers, most of Optimizers use the same logic with in the gradient descent , first of all what is a gradient , gradient is nothing but slope, descent means moving downwards, the idea behind Gradient descent is
Move the value downwards using gradient at that value till gradient becomes zero
Let me explain with an example In diagram we can see there is a convex function in figure(1) , now our motto is to find ‘w’ where loss function will be minimum , from figure(1) we know the loss function is minimum at w₁₀ but in real time it is very complex to us to visualize at which value of ‘w’ the function will get minimize because every function behaves differently. By using gradient descent we can easily find minima of function with out any visualization, once you understand math behind it then you will know why it works , see how it works
Let’s initialize any random value ‘w’, assume that the random value is w₁ in above diagram, at w₁ the function loss(L) is not minimized we have to move w₁ to downwards where our function ‘ L’ will be minimum, but here the question is how to move
step : w_new=w_old - α * (dL/dw)
The above formula is used to move weight ‘w’ , in the above formula ‘w_new’ is the new step we have to move , where w_old is the current weight that means w₁ , dL/dw derivative of a function with respect to weight w_old , where α is the learning rate
From the above graph, you can see how derivative or slope behaves, when tangent or line is parallel to x axis its slope is zero when the line is moving upwards and downwards its slope is greater than zero and less than zero, when calculating derivative of a function at particular point it follows same behavior.
We are calculating derivatives of function based on tangent at that point we already seen this in above basic calculus concept ,
In figure(1) we initialize weight w₁ at left side of graph so tangent at that point will have negative slope means less than zero , so the derivative or slope of function w.r.t to w₁ is always less than zero , if we initialize weight at right side of the function then its derivative will be greater than zero
step : w_new = w_old - α * (dL/dw)
observe carefully here the game changer is derivative , lets look when we initialize at left side of graph in figure(1) the derivative will negative value if we place that negative value in above formula the derivative will become positive by multiplying minus infront of it then it will become
w_new = w_old – α * (- some value)
w_new = w_old + α * (some value)
what happening here you can see the ‘w_old’ is increasing means moving forward in figure(1) you can see w_new =w3 , w₁ moves to w₃
if we repeat this step once again then we move to some other value of ‘w’ as moving forward the derivative slowly tends to zero
as we repeat this step multiple times the ‘w’ get updated (moves) , as moving forward the derivative slowly tends to zero , here at every step the derivative will make your step of ‘w’ small(long steps to small steps) because as moving forward the derivative is going closer to zero (decreasing), after repeating multiple steps the derivative becomes zero , if derivative become zero that means the derivative at w_old is zero, we know when derivative will be zero (tangent is parallel to x axis) , it means the the slope of tangent at weight ‘w_old’ is zero, so w_old is the minima of function , when yourepeat the step again it stops further step then w_new become w_old , w_new= w_old
w_new = w_old + α * (0) ; dL/dw=0 → w_old slope becomes ‘0’
w_new = w_old ; w_old becomes w_new it can not update further (we reach at minima) no further updation because the derivative at ‘ w_old’ will be always zero after reaching minima
I suggest you try by initializing weight on right side of the graph so that you know how step equation changes and how it will converge
In this we have to know what is α , it is know as learning rate , in gradient descent step we are multiplying derivative or gradient with some value called α , to move the ‘w’ with some speed , lets see what happen when α is large
‘α’ is a hyper-parameter , we have to take optimal value of ‘α’ by performing experiment on different values , typically most of the people choose ‘ α’ as 1 while using gradient descent , to reach global minima with minimum time make α decrease for each step of updation by this first it takes a big steps as moving forwards the step size will slowly decreases and reach its minima this is known as schedule
Drawback of Gradient descent
Assume that we have 1 million data points in a dataset , we have to use all data points to find gradient (dL/dw) for each update of ‘w’, calculating 1 million data points for every step it is computationally highly expensive this is the main disadvantage of gradient descent . In real time the data we get will be in very large quantity in terms of billions or trillions known as big data . we can’t use gradient descent on big data , if our dataset is small yeah gradient descent can work very well, but when comes to big data it is computationally highly expensive
Stochastic Gradient descent
we know that if our dataset is large using Gradient descent will leads to computationally complex and expensive to converge towards minima of a function , this is where Stochastic Gradient descent comes into picture . SGD is approximation of gradient descent , it uses the same idea with in the gradient descent except calculating gradient . In Gradient descent while updating weight ‘w’ we calculate gradient (derivative) at each step by using all data points but in stochastic gradient descent instead of using all data points its take any one random point and calculate the derivative and update the ‘w’ .
w_new = w_old - α * (dL/dw) ; dL/dw →calculate based on any one random data point
dL/dw at each step updation it will take one random data point from dataset, because of this ‘w’ update will takes zig-zag motion
why it follows zig-zag motion see here while we are updating or moving ‘w’ we take different data point to calculate derivative or gradient , so the derivative at each updation of ‘w’ will be different , this will makes our ‘w’ to over move or take little step think like zig-zag steps, because of this we may reach to global minima or close to global minima but it takes more steps(epochs) when compare to gradient descent
you can have a doubt that taking one random point at each updation of weight how it will converge to minima, observe carefully our aim is to move towards global minima , here the sign of derivative will tell which side to move and derivative or gradient will tell step size , in case of gradient descent we take all data points and calculate the gradient so it will take some long steps and reach minima in less number of epochs , in case of stochastic gradient descent we are taking one data point randomly so in updation(step) the derivative will be small and zig-zag motion is because of randomness of a data point .
Mini-Batch Gradient descent
Mini-Batch Gradient descent same as Stochastic Gradient descent instead of taking one data point randomly in dataset to calculate derivative or gradient it takes subset of data points at each updation of weight ‘w’ in a dataset , so that it will have less zig-zag motion when compares to SGD
w_new = w_old — α * (dL/dw) ; dL/dw →calculate on subset of data points
you can see in above figure how Mini-batch gradient descent converging towards minima , it takes less steps to reach global minima, this is because of derivative (dL/dw) , the derivative will be large when compares to stochastic gradient descent and it takes large zig-zag steps because of different subsets of data points or you can think like randomness of subsets for calculating gradient . while in SGD we take single random point so it will take small zig-zag steps
what are the challenges we face while performing optimization
while performing gradient descent or SGD or Mini-batch SGD there is a chance that our weight will get stuck at saddle point
saddle point is nothing but a point where our function neither be a maxima nor minima but gradient at that point will be zero, you can see this saddle point in non-convex functions , If we perform gradient descent or sgd or min-batch sgd on non-convex function our weight ‘w’ will converge but it is complex to say whether we are at global minima or local minima or at saddle point , it completely depends on initialization of weights , if we initialize our weights near global minima we reach global minima otherwise we reach either local minima or saddle point
There are some techniques which can help to over come this saddle point but it is complex for us if we stuck at local minima . Research scholars has an argument on how to over come local minima in very high dimensional space ,but they argued that the probability of having local minima in very high dimensional space is very less , so we mostly encounter with saddle point . In machine learning we try to optimize convex functions so here we don’t need to worry about saddle point or local minima but In deep learning we mostly encounter with non-convex functions , there we see more optimizers which can help to overcome saddle point and make speed convergence to minima