M4 Competition winner — Using ES with RNNs for time series forecasting

Aakash Gupta
3 min readJun 29, 2020

Link to the original article
Github link for the code

The winning solution was unique in it’s approach, such that it used a hybrid forecasting method. It mixed exponential smoothing-inspired formulas, used for deseasonalizing and normalizing the series, with advanced neural network.

The 3 main components of this solution were:

  1. De-seasonalization & Adaptive normalization
  2. Generating of forecasts using RNNs; and
  3. Ensembling

The data flow can be visualized as follows:

1. Exponential smoothing, De-seasonalization & Adaptive Normalization * All M4 series have positive values — so models of Holt and Holt & Winters were chosen with multiplicative seasonality were used

  • These were simplified by removing the linear trend
    * Each series was pre-processed for each epoch. A standard approach of constant size, input & output rolling window was used
    * Size of the window == forecasting horizon OR for i/p window == size of one seaon
    * Input & output windows were normalized by dividing them with the last value of the level in the input window.
    * Seasonal time series were then divided by the seasonality component
    * A squashing function log()was then used to reduce the impact of outliers
    * The domain information was added to the time series features as a one-hot encoded long vector

2. Generating forecasts using RNNs

  • The neural networks are dilated LSTM-based stacks, sometimes followed by a non-linear layer & always followed by a linear adapter layer
  • The advantage was instead of just taking the previous state, we are applying weights to a number of past hidden states. — attention mechanism
  • The below figure shows three examples of configurations: first one generates point forecasts for quarterly series; second one for the monthly series (PFs) and the third one is the prediction intervals (PIs) for the yearly series
  • One important implementation detail was the use of an adapted loss function

Since the inputs of the NN were normalized, the author postulated that the loss function need not be normalized. So a pinball function was defined with as 0.5

With the pinball function defined as

This makes the loss function asymmetric. Penalizing the values above & below the quantile separately. Thus overcoming positive bias (which was probably due to the use of the squashing function)

3. Ensembling

  • Concurrently-trained models were used instead of a single model
  • This approach is known as “ensemble of specialists
  • A pool of models was created and randomly parts of the time series were allocated to each model
  • The model was trained on the allocated subset.
  • Models were ranked based on their performance for each series, and the top N best models were selected for each time series
  • Repeat the above steps until the validation error starts increasing
  • So…what didn’t work
  • The forecasts for the monthly, quarterly & yearly series are more accurate the author had concentrated on improving the larger series
  • the daily & weekly forecasts were of lower quality. This was later improved by increasing the learning rate for the smoothing coefficients.

--

--

Aakash Gupta

AI/ML practitioner, cloud specialist & multiple hackathon winner. For consulting assignments, reach out to me — aakash@thinkevolveconsulting.com