Exponentially Weighted Average

4 min readMar 25, 2022

The Exponentially Weighted Moving Average (EWMA) is commonly used as a smoothing technique in time series. However, due to several computational advantages (fast, low-memory cost), the EWMA is behind the scenes of many optimization algorithms in deep learning, including Gradient Descent with Momentum, RMSprop, Adam, etc.

In order to compute the EWMA, you must define one parameter β. This parameter decides how important the current observation is in the calculation of the EWMA.

Lets make an example based on the temperatures of Paris, France, in 2019 ([1]).

Define:

For this example, suppose that β = 0.9, so the EWA aims to combine the temperature of the current day with the previous temperatures.

In general to compute the EWA for a given weight parameter β we use

If we plot this in red, we can see that what we get is a moving average of the daily temperature, it’s like a smooth, less noisy curve.

Lets explain a bit more the general equation:

We can see that the value of β determines how important the previous value is (the trend), and (1-β) how important the current value is.

Take a value of β = 0.98 and plot it in green, notice that the curve is smoother because the trend now is more important (and the current temperature value is less important), so it will adapt more slowly when the temperature changes.

Lets try the other extreme and set β = 0.5, this way the graph you get is noisier, because it is more susceptible to the current temperature (and this includes outliers).

But it adapts more quickly to changes in temperature.

If you want to understand the meaning of the parameter β, you can think of the value

as the numbers of observations used to adapt your EWA.

In order to go a little bit deeper into the intuitions of what this algorithm actually does.

Lets expand the 3rd term (W₃) using the main equation:

Plugin W₁ in W₂ and then in W₃:

Simplifying

Here it is quite clear what the roll of β = 0.9 parameter is in EWA, we can see older observations are given lower weights. The weights fall exponentially as the data point gets older hence the name exponentially weighted.

In general, we have: