Exponentially Weighted Average
The Exponentially Weighted Moving Average (EWMA) is commonly used as a smoothing technique in time series. However, due to several computational advantages (fast, low-memory cost), the EWMA is behind the scenes of many optimization algorithms in deep learning, including Gradient Descent with Momentum, RMSprop, Adam, etc.
In order to compute the EWMA, you must define one parameter β. This parameter decides how important the current observation is in the calculation of the EWMA.
Lets make an example based on the temperatures of Paris, France, in 2019 ([1]).
Define:
For this example, suppose that β = 0.9, so the EWA aims to combine the temperature of the current day with the previous temperatures.
In general to compute the EWA for a given weight parameter β we use
If we plot this in red, we can see that what we get is a moving average of the daily temperature, it’s like a smooth, less noisy curve.
Lets explain a bit more the general equation:
We can see that the value of β determines how important the previous value is (the trend), and (1-β) how important the current value is.
Take a value of β = 0.98 and plot it in green, notice that the curve is smoother because the trend now is more important (and the current temperature value is less important), so it will adapt more slowly when the temperature changes.
Lets try the other extreme and set β = 0.5, this way the graph you get is noisier, because it is more susceptible to the current temperature (and this includes outliers).
But it adapts more quickly to changes in temperature.
If you want to understand the meaning of the parameter β, you can think of the value
as the numbers of observations used to adapt your EWA.
In order to go a little bit deeper into the intuitions of what this algorithm actually does.
Lets expand the 3rd term (W₃) using the main equation:
Plugin W₁ in W₂ and then in W₃:
Simplifying
Here it is quite clear what the roll of β = 0.9 parameter is in EWA, we can see older observations are given lower weights. The weights fall exponentially as the data point gets older hence the name exponentially weighted.
In general, we have:
or the closed formula:
If you are a visual learner, this is another approach:
Rewrite Wₜ using the dot product (*):
And then think Wₜ as the product of the following two plots:
- Exponential Weights
- Temperature
In this plot we see how the weights decay when t increase:
Then plot the temperature
Now in order to get Wₜ, just multiply each point (weights and temperature) and add them up, doing this for each t we obtain the magenta curve.
References:
- [1] University of Dayton — Environmental Protection Agency Average Daily Temperature Archive.
- [2] Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization — online course by Andrew NG: https://www.coursera.org/learn/neural-networks-deep-learning?specialization=deep-learning
- GitHub repository: https://github.com/tobias-chc/Exponentially-WA