Optimizers for machine learning

Mohammad Rafi
Analytics Vidhya
Published in
12 min readJun 30, 2021

--

Introduction

In this we are going to learn optimizers which is the most important part of machine learning , in this blog I try to explain each and every concept of Optimizers in simple terms and visualization so that a beginner also can understand.

This blog is limited for machine learning, after reading the blog, you will know how Optimizers works.

What we are going to learn

  1. What is optimization
  2. Convex and non-Convex functions
  3. Optimizers
  4. Basic Calculus
  5. Gradient descent (Batch gradient descent)
  6. Drawback of Gradient descent
  7. Stochastic Gradient descent (SGD)
  8. Mini-batch Gradient descent
  9. Challenges while performing optimization

Optimization

First of all we have to know what is optimization, in mathematical terms finding the global minima or maxima of any non linear function is known as optimization . Here finding minima and maxima is a problem specific

you can see the above function is maximum at -1 and minimum at 1 .Optimization is nothing but finding the value of x where function f(x) will be minimum or maximum

Convex and non-Convex functions

In machine learning and deep learning we mostly encounter with these type of functions. If we take any two random points on graph of a function and draw a line segment (shortest path) between those two points, if the line segment is above the graph of a function such type of functions we called as convex function , for convex functions shortest path between any two point should be above the graph otherwise it is a non-convex function.

In left , the function f(x) you can see the line segment between two points is above the graph ,in right the function f(x) the line segment between two points is not completely above the graph , therefore the function f(x) in left is convex function and the function f(x) in right is a non-convex function

A function can have more than one minima and maxima. The absolute highest and lowest points of the function including the end points are known as global maxima and global minima, the neighborhood of global maxima and global minima are known as local maxima and local minima.

For convex functions, we have only one global minima or maxima there is no local minima or maxima but in non-convex functions we can see local and global minima’s and maxima’s . In deep learning we encountered more complex non-convex functions

Optimizers

There are some algorithms which can help to find global minima of a function.

Here why should we find global minima why not global maxima?

In machine learning and deep learning, most of our cost functions defined in such a way that to find minima this is just for our convenience which makes math very easy , for example take linear regression we have to minimize (minima) cost function ∑ (yᵢ- h(w*xᵢ+ b))² , if you want you can change your cost function in such away that to find global maxima then below algorithms have to be modify to work for your problem.

However let’s see what are those algorithms which help in optimization( global minima) of any function.

  • Gradient descent (Batch gradient descent)
  • Stochastic Gradient descent (SGD)
  • Mini-batch Gradient descent

Basic calculus

Before going into Optimizers we have to know some basics of calculus.

In mathematics each and every line and function has a slope but what is slope

  • the rate of change of y with respect to x is known as slope
  • lets assume if x1 changes to x 1→x2 and y1 changes to y1 →y2
  • then slope(m)=(y2-y1)/(x2-x1)
  • slope tells you how much change in y by change in x

lets recall basics of slope which we learned in our schooling

→ How do we find the slope of line?

  • the angle made the line with x-axis is known as slope (m)

where m=tanθ ; θ=angle

→ Another way to find the slope of line if you know two points on a line (x1,y1) ,(x2,y2)

then slope (m)=(y2-y1)/(x2–x1)

But when comes to continuous function, slope changes from point to point we cant use above formulas we have to find another way to find a slope of continuous function. If you don’t know what is continuous function I suggest you please go to Wikipedia or khan academy.

Lets come to our discussion how to find a slope of continuous function at any particular point , see here for every continuous function we draw a tangent at particular point where we want to find slope or derivative, don’t worry derivative is the term used instead of slope in calculus, the angle made by that tangent with x-axis is known as slope or derivative at that particular point and we can also define derivative using limits, let assume we have function f(c) when c changes to c + h , the function f(c) changes its value to f(c + h)

c →c + h , f(c) →f(c + h)

then slope =f(c + h)-f(c)/(c + h-c) (our basic slope formula (x2 – x1) / (y2 – y1))

Where slope=(f(c + h)-f(c))/h

What if the limit of h tends to approximately zero but not zero, means ‘h’ changes to very little we can assume as 0.0001 or any point value

For clear understanding if we use our original formula of slope

(y2-y1)/(x2-x1) — -(1)

assume h=0.0001 , slope=f( c + h )-f(c)/c + h-c

the slope =f(c + h)-f(c)/h =f(c + 0.0001)-f(c)/0.0001 — — — -(2)

compare (1) and (2)

x2-x1=0.0oo1 , y2-y1=f(c + 0.0001)-f(c) ,

there is point change in x so there will be point change in y

( small change (point change) in c ->c+0.01 the function changes very little f(c) -> f(c+0.0001) , so the slope at c+0.0001 is almost equals to slope at c)

In the above figure you can see when h is very small it almost equal to point c

Therefore slope (m) at point c =f(c + h)-f(c)/h ; where limit h — -> 0.0001 or you can take any point value

In calculus, we call slope as derivative (dy/dx) or gradient

Every time we don’t use limits to calculate derivative of a function. For speed calculation of derivatives or slope of a functions we have to know some basic formulas in differentiation. I strongly suggest that if you don’t know differentiation please recall basic formulas of differentiation which we learned during our schooling because by using those formulas we can directly calculate derivative of a functions at any point.

Gradient descent

This is the father of all optimizers, most of Optimizers use the same logic with in the gradient descent , first of all what is a gradient , gradient is nothing but slope, descent means moving downwards, the idea behind Gradient descent is

Move the value downwards using gradient at that value till gradient becomes zero

figure(1)

Let me explain with an example In diagram we can see there is a convex function in figure(1) , now our motto is to find ‘w’ where loss function will be minimum , from figure(1) we know the loss function is minimum at w₁₀ but in real time it is very complex to us to visualize at which value of ‘w’ the function will get minimize because every function behaves differently. By using gradient descent we can easily find minima of function with out any visualization, once you understand math behind it then you will know why it works , see how it works

Let’s initialize any random value ‘w’, assume that the random value is w₁ in above diagram, at w₁ the function loss(L) is not minimized we have to move w₁ to downwards where our function ‘ L’ will be minimum, but here the question is how to move

step : w_new=w_old - α * (dL/dw)

The above formula is used to move weight ‘w’ , in the above formula ‘w_new’ is the new step we have to move , where w_old is the current weight that means w₁ , dL/dw derivative of a function with respect to weight w_old , where α is the learning rate

From the above graph, you can see how derivative or slope behaves, when tangent or line is parallel to x axis its slope is zero when the line is moving upwards and downwards its slope is greater than zero and less than zero, when calculating derivative of a function at particular point it follows same behavior.

We are calculating derivatives of function based on tangent at that point we already seen this in above basic calculus concept ,

In figure(1) we initialize weight w₁ at left side of graph so tangent at that point will have negative slope means less than zero , so the derivative or slope of function w.r.t to w₁ is always less than zero , if we initialize weight at right side of the function then its derivative will be greater than zero

step : w_new = w_old - α * (dL/dw)

observe carefully here the game changer is derivative , lets look when we initialize at left side of graph in figure(1) the derivative will negative value if we place that negative value in above formula the derivative will become positive by multiplying minus infront of it then it will become

w_new = w_old – α * (- some value)

w_new = w_old + α * (some value)

what happening here you can see the ‘w_old’ is increasing means moving forward in figure(1) you can see w_new =w3 , w₁ moves to w₃

if we repeat this step once again then we move to some other value of ‘w’ as moving forward the derivative slowly tends to zero

as we repeat this step multiple times the ‘w’ get updated (moves) , as moving forward the derivative slowly tends to zero , here at every step the derivative will make your step of ‘w’ small(long steps to small steps) because as moving forward the derivative is going closer to zero (decreasing), after repeating multiple steps the derivative becomes zero , if derivative become zero that means the derivative at w_old is zero, we know when derivative will be zero (tangent is parallel to x axis) , it means the the slope of tangent at weight ‘w_old’ is zero, so w_old is the minima of function , when yourepeat the step again it stops further step then w_new become w_old , w_new= w_old

w_new = w_old + α * (0) ; dL/dw=0 → w_old slope becomes ‘0’

w_new = w_old ; w_old becomes w_new it can not update further (we reach at minima) no further updation because the derivative at ‘ w_old’ will be always zero after reaching minima

I suggest you try by initializing weight on right side of the graph so that you know how step equation changes and how it will converge

j(w) known as our loss function as we seen above figure(1)

In this we have to know what is α , it is know as learning rate , in gradient descent step we are multiplying derivative or gradient with some value called α , to move the ‘w’ with some speed , lets see what happen when α is large

here u can see when α is large , there is chance it will move away from minima , it will start oscillate around minima because of derivative or gradient , if α is small it will take more time to reach minima

‘α’ is a hyper-parameter , we have to take optimal value of ‘α’ by performing experiment on different values , typically most of the people choose ‘ α’ as 1 while using gradient descent , to reach global minima with minimum time make α decrease for each step of updation by this first it takes a big steps as moving forwards the step size will slowly decreases and reach its minima this is known as schedule

Drawback of Gradient descent

Assume that we have 1 million data points in a dataset , we have to use all data points to find gradient (dL/dw) for each update of ‘w’, calculating 1 million data points for every step it is computationally highly expensive this is the main disadvantage of gradient descent . In real time the data we get will be in very large quantity in terms of billions or trillions known as big data . we can’t use gradient descent on big data , if our dataset is small yeah gradient descent can work very well, but when comes to big data it is computationally highly expensive

Stochastic Gradient descent

we know that if our dataset is large using Gradient descent will leads to computationally complex and expensive to converge towards minima of a function , this is where Stochastic Gradient descent comes into picture . SGD is approximation of gradient descent , it uses the same idea with in the gradient descent except calculating gradient . In Gradient descent while updating weight ‘w’ we calculate gradient (derivative) at each step by using all data points but in stochastic gradient descent instead of using all data points its take any one random point and calculate the derivative and update the ‘w’ .

w_new = w_old - α * (dL/dw) ; dL/dw →calculate based on any one random data point

dL/dw at each step updation it will take one random data point from dataset, because of this ‘w’ update will takes zig-zag motion

figure(2)

why it follows zig-zag motion see here while we are updating or moving ‘w’ we take different data point to calculate derivative or gradient , so the derivative at each updation of ‘w’ will be different , this will makes our ‘w’ to over move or take little step think like zig-zag steps, because of this we may reach to global minima or close to global minima but it takes more steps(epochs) when compare to gradient descent

The left one is a 3 dimensional graph of a convex function , if we see it from top angle it will looks like circles , in this you can see how Gradient descent and Stochastic Gradient descent are converging towards minima

you can have a doubt that taking one random point at each updation of weight how it will converge to minima, observe carefully our aim is to move towards global minima , here the sign of derivative will tell which side to move and derivative or gradient will tell step size , in case of gradient descent we take all data points and calculate the gradient so it will take some long steps and reach minima in less number of epochs , in case of stochastic gradient descent we are taking one data point randomly so in updation(step) the derivative will be small and zig-zag motion is because of randomness of a data point .

Mini-Batch Gradient descent

Mini-Batch Gradient descent same as Stochastic Gradient descent instead of taking one data point randomly in dataset to calculate derivative or gradient it takes subset of data points at each updation of weight ‘w’ in a dataset , so that it will have less zig-zag motion when compares to SGD

w_new = w_old — α * (dL/dw) ; dL/dw →calculate on subset of data points

you can see in above figure how Mini-batch gradient descent converging towards minima , it takes less steps to reach global minima, this is because of derivative (dL/dw) , the derivative will be large when compares to stochastic gradient descent and it takes large zig-zag steps because of different subsets of data points or you can think like randomness of subsets for calculating gradient . while in SGD we take single random point so it will take small zig-zag steps

what are the challenges we face while performing optimization

while performing gradient descent or SGD or Mini-batch SGD there is a chance that our weight will get stuck at saddle point

In 2-dimensional space
in 3-dimensional space it look like this

saddle point is nothing but a point where our function neither be a maxima nor minima but gradient at that point will be zero, you can see this saddle point in non-convex functions , If we perform gradient descent or sgd or min-batch sgd on non-convex function our weight ‘w’ will converge but it is complex to say whether we are at global minima or local minima or at saddle point , it completely depends on initialization of weights , if we initialize our weights near global minima we reach global minima otherwise we reach either local minima or saddle point

There are some techniques which can help to over come this saddle point but it is complex for us if we stuck at local minima . Research scholars has an argument on how to over come local minima in very high dimensional space ,but they argued that the probability of having local minima in very high dimensional space is very less , so we mostly encounter with saddle point . In machine learning we try to optimize convex functions so here we don’t need to worry about saddle point or local minima but In deep learning we mostly encounter with non-convex functions , there we see more optimizers which can help to overcome saddle point and make speed convergence to minima

--

--