Forecasting in Python with ESRNN model

Published in

Analytics Vidhya

7 min readJun 16, 2020

M4 Competition and Background

Deep Learning algorithms enjoys success in a variety of tasks ranging from image classification to natural language processing; its use in time series forecasting has also began to spread. On the recent M4 major forecasting competition, a novel multivariate hybrid ML(Deep Learning)-time series model called Exponential Smoothing Recurrent Neural Network (ESRNN) won by a large margin over baselines and complex time series ensembles.

In this post, we introduce the model and show its use on a Pytorch implementation which achieves state of the art performance on the M4 competition:

The GPU implementation achieves a x300 speed up over the original Smyl model in C++ using Dynet library.
The model can be easily used on new (non M4) data, since our class was built similar to scikit-learn models with fit and predict methods.

For anyone interested in exploring the model deeper the package is available at https://pypi.org/project/ESRNN/ and the following github page https://github.com/kdgutier/esrnn_torch.

Model

The premise of this model is simple and yet intuitive and appealing. The model cleverly combines the classic Exponential Smoothing model (ES) and a Recurrent Neural Network (RNN). The ES decomposes the time series in level, trend and seasonality components. The RNN is trained with all the series, has shared parameters and it is used to learn common local trends among the series while the ES parameters are specific for each time series. The models are combined by including the output of the RNN as the local trend component in the ES model.

One main challenge of this idea is that local trends are not directly observed. Also, for the output of the RNN to be meaningful the trends must be comparable between series. The model addresses this by normalizing and deseasonalizing the series given by the ES decomposition. This preprocessing is then an integral part of the algorithm instead of taking place before the training process. Another advantage of the RNN is that allows for exogenous variables, which in the M4 example corresponds to dummies of the category.

Regarding the architecture of the RNN, Smyl proposed to use different architectures depending on the frequency of the data. The basic architecture is a dilated-RNN with LSTM cells, this allowed the RNN to reduce the number of parameters while stacking more layers. For series without obvious seasonality, such as the yearly data, an attention layer is added. More information on these architectures can be found in the references.

Loss function

The ESRNN model optimizes over two losses. First, the quantile loss with minimizer the quantile of the target variable and second, a penalty on the variance or wiggliness of the predictions as a regularizer. The quantile loss is given by:

The quantile loss makes the model to predict the conditional quantiles of the target distribution, it is robust and does not make distributional assumptions. Usually the model is trained to fit the median, but in case the model consistently underestimates or overestimates the target values, the quantile can be changed accordingly.

Example on M4 data

Usage Example

The library can be installed from the python package index with:

pip install ESRNN

The library also includes some utilities that allows us to easily experiment with the model. The prepare_m4_data function allows us to obtain data from the M4 competition, so it can be easily used with the model. In particular, it returns predictions from the Naive2 model; this predictions can be used to evaluate each iteration of the ESRNN through the Overall Weighted Average. Here we are obtaining the 414 hourly time series of the M4 data, which are stored in the './data' folder:

Successfully downloaded M4-info.csv 4335598 bytes.
Successfully downloaded Train/Daily-train.csv 95765153 bytes.
Successfully downloaded Train/Hourly-train.csv 2347115 bytes.
Successfully downloaded Train/Monthly-train.csv 91655432 bytes.
Successfully downloaded Train/Quarterly-train.csv 38788547 bytes.
Successfully downloaded Train/Weekly-train.csv 4015067 bytes.
Successfully downloaded Train/Yearly-train.csv 25355736 bytes.
Successfully downloaded Test/Daily-test.csv 576459 bytes.
Successfully downloaded Test/Hourly-test.csv 132820 bytes.
Successfully downloaded Test/Monthly-test.csv 7942698 bytes.
Successfully downloaded Test/Quarterly-test.csv 1971754 bytes.
Successfully downloaded Test/Weekly-test.csv 44247 bytes.
Successfully downloaded Test/Yearly-test.csv 1486434 bytes.


Preparing Hourly dataset
Preparing Naive2 Hourly dataset predictions

The model is built to function similarly to scikit-learn models. It is instantiated as follows (for a detailed description of the parameters, see the documentation):

The model is trained with the fit method. If the test set is passed to it, the method will compute out-of-sample losses for this set at the end. This method receives X_df, y_df training pandas dataframes in long format. Optionally X_test_df and y_test_df to compute out of sample performance.

The 'X' and 'y' dataframes must contain the same values for 'unique_id', 'ds' columns and be balanced, ie.no gaps between dates for the frequency.

The frequency of computing and reporting this loss can be changed with the freq_of_test hyperparameter.

model.fit(X_train_df, y_train_df)Infered frequency: H
=============== Training ESRNN  ======================== Epoch 0 finished =========
Training time: 50.14884
Training loss (50 prc): 0.70241
========= Epoch 1 finished =========
Training time: 51.24384
Training loss (50 prc): 0.59290
========= Epoch 2 finished =========
Training time: 51.81561
Training loss (50 prc): 0.53481
========= Epoch 3 finished =========
Training time: 52.64761
Training loss (50 prc): 0.49683
========= Epoch 4 finished =========
Training time: 50.96984
Training loss (50 prc): 0.46950
Train finished!

Finally the predictions are obtained with the predict method. Furthermore, the package has a special function to calculate the OWA of the predictions, evaluate_prediction_owa.

===============  Model evaluation  ==============
OWA: 0.987 
SMAPE: 15.623 
MASE: 2.69

A function has also been implemented to plot predictions:

Comparison with M4 winning submission

Naive2 Forecast

The Naive2 model is a popular benchmark model for time series forecasting that automatically adapts to the potential seasonality of a series based on an autocorrelation test. If the series is seasonal the model composes the predictions of Naive and Seasonal Naive, else the model predicts on the simple Naive. Following the M4 competition practice we report the relative performance of the ESRNN compared to Naive2.

Overall Weighted Average

To quantify the aggregated errors we use the Overall Weighted Average (OWA) proposed for the M4 competition. This metric is calculated by obtaining the average of the symmetric mean absolute percentage error (sMAPE) and the mean absolute scaled error (MASE) for all the time series and also calculating it for the Naive2 predictions. Both sMAPE and MASE are scale independent. These measurements are calculated as follows:

The following table shows the OWA obtained by our implementation and the original model. The results deviate slightly from original implementation, but still very competitive on the M4 leaderboard, placing it in the top 5 models. Also, these results were achieved with a x300 speedup over Smyl’s implementation, since we are batching the time series for training and our model can be trained in GPU.

How to contribute

The full code is publicly available at github. To contribute you can fork this repository and make a PR with your improvements. You can also create issues if you have problems running the model.

Authors

This repository was developed with joint efforts from AutonLab researchers at Carnegie Mellon University and Orax data scientists.