#08 Gradient Descent Algorithm Optimization Recipes #ShortcutML

勾配降下法の最適化アルゴリズム

Akira Takezawa

Published in

Coldstart.ml

5 min readMar 18, 2019

Hola! Welcome to #ShortcutML Series! ML Cheat Note for everyone!

1. The goal of all ML Implementation

https://www.youtube.com/watch?v=b4Vyma9wPHo

The process of ML implementation can be called

The process of minimizing the number or size of Error between prediction and unknown future.

The purpose of ML is almost always prediction for unknown futures with high accuracy and high speed, high continuity. They reply class or value, sometimes words and text.

Therefore as a machine learning engineer, the main concern should be “How can we reduce errors and realize accurate prediction?”

The answer to this fundamental question is this topic!: Gradient Descent

2. Bias vs. Variance in Error Function

Error = Bias + Variance + Noise*

Bias: Gap between training data and model
Variance: Complexity of model
Noise:

A trade-off between Bias and Variance

To decrease Bias, you need to increase Variance of model.
To decrease Variance, you have to accept growth of Bias.

model complexity from linear to a random tree
Sweet spot!!! As low as possible!!
noise is always there means 100% accuracy is impossible

3. Underfitting vs. Overfitting

https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Why does happen?

The reason is occasional but mainly these two reasons:

When The size of training data is small
When the model is too complex for the task

Thus the solution for Underfitting and Overfitting could be below:

Increase your amount of Data
Regularization

4. A list of Gradient Descent Methods

What is Gradient Descent?

The gradient descent method is a method for minimizing the objective function parameterized by parameters of the model and is realized by updating the parameter in the direction opposite to the gradient direction of the objective function.

The learning rate determines how large steps are taken to reach the (local) minimum.

In other words, it is assumed that there is a surface made by the objective function, moving according to the downward slope of the surface until it reaches this valley.

Here is a well-known top 3 method:

Stochastic Gradient Descent
Mini-batch Gradient Descent
Batch Gradient Descent

4–2. Comparison of gradient descent methods

1. Stochastic Gradient Descent

The stochastic gradient descent method shuffles learning data, randomly extracts one from the learning data, calculates the error, and updates the parameters. Although it is not as accurate as the gradient descent method but re-learned only with the increased learning data (the initial value of the weight vector is diverted from the previous learning result), the calculation amount of re-learning is overwhelmingly low.

5. A list of Optimization Algorithms of GD

http://ruder.io/optimizing-gradient-descent/

Keyword:

Global vs Minimum
Learning Rate

Which optimizer to choose?

SGD > Adam?? Which One Is The Best Optimizer: Dogs-VS-Cats Toy Experiment

Here I listed several Optimization Algorithm for Gradient Descent:

Momentum
Adagrad
Adadelta
RMSprop
Adam

Let’s take a look all of them!

1. Momentum

The Momentum term increases towards the dimension where the gradient points in the same direction. And the gradient decreases the update towards the changing direction. As a result, the convergence speeds up and vibration can be suppressed.

2. Adagrad

Adagrad adapts the learning rate to the parameters. And we will perform larger updates for rare parameters and smaller updates for frequent parameters. For this reason, it is suitable for handling sparse data.

3. Adadelta

Adadelta is an evolution of Adagrad that explores ways to prevent a rapid and monotonous decline in learning rates. Adadelta does not accumulate the square of the gradient at all past time steps but limits the area where the past gradient is accumulated to certain constants.

4. RMSprop

RMSprop is an algorithm that’s used for full-batch optimization. It tries to resolve the problem that gradients may vary widely in magnitudes. Some gradients may be tiny and others may be huge, which result in a very difficult problem, trying to find a single global learning rate for the algorithm.

5. Adam

Adam (Adaptive Moment Estimation) is another way to calculate and adapt the learning rate for each parameter. Adadelta and RMSprop accumulated the exponential decay average of the squares of past gradients.

In addition to this, Adam holds the exponential decay mean of the past slope in the same way. It is a method similar to Momentum.

— — —

References

10 Gradient Descent Optimisation Algorithms

Gradient descent is an optimisation method for finding the minimum of a function. It is commonly used in deep learning…

towardsdatascience.com

Understanding the Bias-Variance Tradeoff

Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). There is a…

towardsdatascience.com

Gradient Descent Algorithm and Its Variants

Optimization refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. In machine/deep…

towardsdatascience.com

#08 Gradient Descent Algorithm Optimization Recipes #ShortcutML

勾配降下法の最適化アルゴリズム

Menu

1. The goal of all ML Implementation

2. Bias vs. Variance in Error Function

A trade-off between Bias and Variance

3. Underfitting vs. Overfitting

4. A list of Gradient Descent Methods

What is Gradient Descent?

4–2. Comparison of gradient descent methods

1. Stochastic Gradient Descent

5. A list of Optimization Algorithms of GD

Which optimizer to choose?

1. Momentum

2. Adagrad

3. Adadelta

4. RMSprop

5. Adam

References

10 Gradient Descent Optimisation Algorithms

Gradient descent is an optimisation method for finding the minimum of a function. It is commonly used in deep learning…

Understanding the Bias-Variance Tradeoff

Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). There is a…

Gradient Descent Algorithm and Its Variants

Optimization refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. In machine/deep…

バイアス-バリアンス分解：機械学習の性能評価 — HELLO CYBERNETICS

Written by Akira Takezawa