Exponential Moving average
Why do we need moving average?
Generally, we calculate moving average of a stock price or a dependent variable of a time series so that we have a smoothed out graph curve instead of initial noisy data points. The graph can now be used to analyse data points and long term trends since the noisy short term fluctuations have been smoothed out.
Weighted moving average
Averaging over dependent variable would mean considering a window at a timestamp and averaging values within the window to get the MA value at that point.
Weighted moving average just adds different weights to data points depending on when it happened. Data points in immediate past are weighted higher in the average than data points dated long back.
Example of Exponential Weighted Average:
Given a time series of temperature with time.
The blue data points are noisy and we calculate the Exponential moving average using the formula :
Why “Exponential Moving average”
Coefficients of final equation form an exponentially decaying curve and the temperatures at time points t are multiplied with this decaying curve. since the decay is exponential, we get the name exponential Moving Average.
Implementing Exponential Moving average algorithm
Benefits of algorithm in Machine Learning: Computation efficient, requires lesser memory as compared to Gradient Descent.
Comparison to other optimisation algorithm : Gradient Descent
The problem with gradient descent is that the weight update at a moment (t) is governed by the learning rate and gradient at that moment only. It doesn’t take into account the past steps taken while traversing the cost space. This means it would get stuck at saddle points easily. By adding a momentum term in the gradient descent, gradients accumulated from past iterations will push the cost further to move around a saddle point even when the current gradient is negligible or zero. This momentum is added using Exponential Moving averages.
Usage
- As a standalone convergence algorithm or,
- Used in combination with gradient descent (Gradient Descent with Momentum) with the modified update step :
Conclusion
When training a model, it is often beneficial to maintain moving averages of the trained parameters. Evaluations that use averaged parameters sometimes produce significantly better results than the final trained values. In this case the ema is calculated on weight parameters immediately after creating the model weights, and then after each training step. Thus, inside the training loop, ema is applied after the optimiser has been applied.
References:
https://towardsdatascience.com/gradient-descent-with-momentum-59420f626c8f