From the Edge: Choosing the Right Optimizer

Jake Tauscher
3 min readAug 24, 2020

--

This blog is part of a series on recent academic papers in the AI/ML community. By understanding what experts are spending their time researching, we will get a sense of the current limits and the future of the AI/ML world!

Researchers from the University of Tubingen conducted research in which they benchmarked the performance of certain “optimizer” functions across a wide array of modelling tasks.

Why is this interesting?

First, what is an optimizer? Well, in training a neural network, you have two key decisions — your loss function (also called a cost function) and your optimizer. The loss function determines how you will grade the performance of your model — your model will try to update its parameters to minimize the loss function, driving it toward 0. But, what does ‘try to update its parameters’ even mean? This is where the optimizer comes in — it determines how you update those parameters every training run.

Choosing the right optimizer is often seen more as an art than a science. There are several optimizers that are well-regarded in the space, but as these researchers note, the choice for which optimizer to use is often not scientific. Instead, data scientists often have personal preferences that they use pretty much universally. It is easy to understand why — there are 100s of options to choose from, and there is debate within the machine learning community about the relative merits of each. It would be very difficult to stay up to date on all optimizers that are available.

So, these researchers tested optimizers across a broad spectrum of modeling tests, to see if there were any that proved superior!

Tell me the details!

These researchers evaluated fourteen popular optimizers across eight deep learning problems. These problems were primarily image classification, but they also included two image generation models, a natural language processing problem, and matching a set of data roughly corresponding to a quadratic function. They believe this to be the “most comprehensive” test of optimizers ever conducted.

So, what did they (and we) learn?

Basically, there was no one algorithm that was best across all situations — that would have been the easy answer. Instead, there were some algorithms that varied wildly across different problems, performing very well in some instances, and very poorly in others. However, there were also optimizers that generally performed well across all these problems (for example, the popular Adam optimizer).

Therefore, if you are going for state of the art results in a problem, you may have to try many optimizers. But, for just basic modeling, you should be good by using any one of a few popular optimizers (Adam, RAdam, AMSGrad, to name a few).

Parts of Data Science are “Guess and Check”

More to point, this experiment reflects a broader truth about AI modeling. Many of the choices you make in building a neural network (number of layers, number of neurons, hyperparameters, cost function, optimizer) do not have universally optimum choices. Every dataset is unique, and to achieve state of the art modeling results, you often need to manually test many different configurations of your model and your training set-up to see what provides the best results.

This can be really tedious, and time consuming, which is why there is a whole class of start-ups that is trying to automate this process for data scientists!

And you can read the paper yourself!: arXiv:2007.01547v2

--

--