Introduction To Optimizers

Shakil Ahmed Sumon
2 min readFeb 27, 2019

--

Image collected from machinelearningmastery.com

The optimizer is a term deep learning engineers frequently stumble into. People are often confused about when to use which optimizer. In this series of blog posts, we will study what is an optimizer and will go through different types of optimizers. I hope at the end of this series we will have a glimpse of the existing optimizers used by the deep learning engineers and will have learned about their key differences.

In any of our deep learning model, we introduce a loss function which is a way of calculating how wrong the predictions are from the targets. Taking the error measurement given by the loss function into account, we make small changes to our model parameters (i.e. weights and biases) with a view to making the predictions less erroneous. But the question is, how do we know when should we change our parameters and if we do then, by how much? This is the time we introduce the optimizers.

In very simple terms, optimizers optimize the loss function. The job of the optimizer is to change the trainable parameters in a way that minimizes the loss function. The loss function guides the optimizer on moving in the right direction.

The optimizers people use can be categorized into two major groups:
1. First order optimization algorithms.
2. Second order optimization algorithms.

First order optimization algorithms minimize the loss function using the first order partial derivate values. First order derivative gives us the indication whether the function is increasing or decreasing at a particular point. You can brush up your first order derivative knowledge here:

The first order partial derivatives are called gradients. The gradient of a function gives a vector field. A gradient is represented by a Jacobian.

Almost all the optimizers those are being used today fall in this category as it is comparatively faster and easier to calculate the gradients.

Second order optimization algorithms use the second-order partial derivatives which we also call Hessian to minimize the loss function. Since it is costlier to compute the second order derivatives, people tend to avoid using this method if not the second derivatives are known.

This is it for today. We will study gradient descent in our next blog post of this series.

--

--

Shakil Ahmed Sumon

Specialist | AI Innovation and Planning | Robi Axiata Limited