M4 Competition winner — Using ES with RNNs for time series forecasting

3 min readJun 29, 2020

Link to the original article
Github link for the code

The winning solution was unique in it’s approach, such that it used a hybrid forecasting method. It mixed exponential smoothing-inspired formulas, used for deseasonalizing and normalizing the series, with advanced neural network.

The 3 main components of this solution were:

De-seasonalization & Adaptive normalization
Generating of forecasts using RNNs; and
Ensembling

The data flow can be visualized as follows:

1. Exponential smoothing, De-seasonalization & Adaptive Normalization * All M4 series have positive values — so models of Holt and Holt & Winters were chosen with multiplicative seasonality were used

These were simplified by removing the linear trend
* Each series was pre-processed for each epoch. A standard approach of constant size, input & output rolling window was used
* Size of the window == forecasting horizon OR for i/p window == size of one seaon
* Input & output windows were normalized by dividing them with the last value of the level in the input window.
* Seasonal time series were then divided by the seasonality component
* A squashing function log()was then used to reduce the impact of outliers
* The domain information was added to the time series features as a one-hot encoded long vector

2. Generating forecasts using RNNs

The neural networks are dilated LSTM-based stacks, sometimes followed by a non-linear layer & always followed by a linear adapter layer
The advantage was instead of just taking the previous state, we are applying weights to a number of past hidden states. — attention mechanism
The below figure shows three examples of configurations: first one generates point forecasts for quarterly series; second one for the monthly series (PFs) and the third one is the prediction intervals (PIs) for the yearly series

One important implementation detail was the use of an adapted loss function

Since the inputs of the NN were normalized, the author postulated that the loss function need not be normalized. So a pinball function was defined with as 0.5

With the pinball function defined as

This makes the loss function asymmetric. Penalizing the values above & below the quantile separately. Thus overcoming positive bias (which was probably due to the use of the squashing function)

3. Ensembling

Concurrently-trained models were used instead of a single model
This approach is known as “ensemble of specialists”
A pool of models was created and randomly parts of the time series were allocated to each model
The model was trained on the allocated subset.
Models were ranked based on their performance for each series, and the top N best models were selected for each time series
Repeat the above steps until the validation error starts increasing

So…what didn’t work
The forecasts for the monthly, quarterly & yearly series are more accurate the author had concentrated on improving the larger series
the daily & weekly forecasts were of lower quality. This was later improved by increasing the learning rate for the smoothing coefficients.

M4 Competition winner — Using ES with RNNs for time series forecasting

Written by Aakash Gupta