#08 Gradient Descent Algorithm Optimization Recipes #ShortcutML

勾配降下法の最適化アルゴリズム

Akira Takezawa
Coldstart.ml
5 min readMar 18, 2019

--

Hola! Welcome to #ShortcutML Series! ML Cheat Note for everyone!

Menu

  1. The goal of all ML Implementation
  2. Bias vs. Variance in Error Function
  3. Underfitting vs. Overfitting
  4. Gradient Descent Methods List
  5. Optimization Algorithms of GD List

1. The goal of all ML Implementation

https://www.youtube.com/watch?v=b4Vyma9wPHo

The process of ML implementation can be called

The process of minimizing the number or size of Error between prediction and unknown future.

The purpose of ML is almost always prediction for unknown futures with high accuracy and high speed, high continuity. They reply class or value, sometimes words and text.

Therefore as a machine learning engineer, the main concern should be “How can we reduce errors and realize accurate prediction?”

The answer to this fundamental question is this topic!: Gradient Descent

2. Bias vs. Variance in Error Function

Error = Bias + Variance + Noise*

  • Bias: Gap between training data and model
  • Variance: Complexity of model
  • Noise:

A trade-off between Bias and Variance

To decrease Bias, you need to increase Variance of model.

To decrease Variance, you have to accept growth of Bias.

  • model complexity from linear to a random tree
  • Sweet spot!!! As low as possible!!
  • noise is always there means 100% accuracy is impossible

3. Underfitting vs. Overfitting

https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Why does happen?

The reason is occasional but mainly these two reasons:

  • When The size of training data is small
  • When the model is too complex for the task

Thus the solution for Underfitting and Overfitting could be below:

  • Increase your amount of Data
  • Regularization

4. A list of Gradient Descent Methods

Gradient Descent: All You Need to Know

What is Gradient Descent?

The gradient descent method is a method for minimizing the objective function parameterized by parameters of the model and is realized by updating the parameter in the direction opposite to the gradient direction of the objective function.

The learning rate determines how large steps are taken to reach the (local) minimum.

In other words, it is assumed that there is a surface made by the objective function, moving according to the downward slope of the surface until it reaches this valley.

Here is a well-known top 3 method:

  1. Stochastic Gradient Descent
  2. Mini-batch Gradient Descent
  3. Batch Gradient Descent

4–2. Comparison of gradient descent methods

Linear Regression- Deep View(Part 1)

1. Stochastic Gradient Descent

The stochastic gradient descent method shuffles learning data, randomly extracts one from the learning data, calculates the error, and updates the parameters. Although it is not as accurate as the gradient descent method but re-learned only with the increased learning data (the initial value of the weight vector is diverted from the previous learning result), the calculation amount of re-learning is overwhelmingly low.

5. A list of Optimization Algorithms of GD

Keyword:

  • Global vs Minimum
  • Learning Rate

Which optimizer to choose?

SGD > Adam?? Which One Is The Best Optimizer: Dogs-VS-Cats Toy Experiment

Here I listed several Optimization Algorithm for Gradient Descent:

  1. Momentum
  2. Adagrad
  3. Adadelta
  4. RMSprop
  5. Adam

Let’s take a look all of them!

1. Momentum

The Momentum term increases towards the dimension where the gradient points in the same direction. And the gradient decreases the update towards the changing direction. As a result, the convergence speeds up and vibration can be suppressed.

2. Adagrad

Adagrad adapts the learning rate to the parameters. And we will perform larger updates for rare parameters and smaller updates for frequent parameters. For this reason, it is suitable for handling sparse data.

3. Adadelta

Adadelta is an evolution of Adagrad that explores ways to prevent a rapid and monotonous decline in learning rates. Adadelta does not accumulate the square of the gradient at all past time steps but limits the area where the past gradient is accumulated to certain constants.

4. RMSprop

RMSprop is an algorithm that’s used for full-batch optimization. It tries to resolve the problem that gradients may vary widely in magnitudes. Some gradients may be tiny and others may be huge, which result in a very difficult problem, trying to find a single global learning rate for the algorithm.

5. Adam

Adam (Adaptive Moment Estimation) is another way to calculate and adapt the learning rate for each parameter. Adadelta and RMSprop accumulated the exponential decay average of the squares of past gradients.

In addition to this, Adam holds the exponential decay mean of the past slope in the same way. It is a method similar to Momentum.

— — —

References

--

--

Akira Takezawa
Coldstart.ml

Data Scientist, Rakuten / a discipline of statistical causal inference and time-series modeling / using Python and Stan, R / MLOps is my current concern