Time Series Forecasting Strategies in ETNA

Published in

IT’s Tinkoff

9 min readMar 24, 2023

Hello, I’m Dmitry and I’m one of the developers of ETNA, a package for time series forecasting. In this article I’ll explain what forecasting strategies are and how to use them in ETNA.

What is a forecasting strategy?

In contrast to the solving of a machine learning task on static data, the prediction of time series has a special property: the use of previous time series values as features. These features — which we discussed in a previous article — are often referred to as lags:

Time series forecasting with ETNA: first steps

In this tutorial, I want to show how to use ETNA for simple time series analysis and introduce several feature…

medium.com

Let’s have a look at how the use of lags affects the forecasting process.

Let us imagine that we want to forecast the demand for apples for the next day. We have a table of sales data. The first row contains the day number, the second the apple sales.

To make a forecast we gather sales for the last 5 days and aggregate them into a table for fitting an ML model:

Let’s create an algorithm to predict 11th day sales:

At the end of the 10th day, we collect data for the last 5 days: [5, 10, 12, 12, 9].
Create a feature vector for our ML model.
Make a prediction on the 11th day.
Send the results to the client.

It looks very simple. However, in many cases it may not be sufficient to make a prediction for the next day only. We can make a quick decision to increase the supply of apples, but we can’t implement it. There will be some delay.

Let’s say we can only increase supply in 3 days. The good solution is to change our target variable and predict the sales on the 3rd day from the current one. During the training we should get rid of the columns lag_1, lag_2 because they won’t be available. During the forecast, we collect the sales data of the last 3 days and put them into the columns lag_3, lag_4, lag_5.

But what if we want to forecast sales for all days from 11 to 13? In this case, our forecast horizon is 3 days. And our initial algorithm, which predicts only 1 value, isn’t up to the task.

This type of problem is usually called multi-step time series forecasting. And now we are ready to understand what a forecasting strategy is within the scope of this article and the ETNA library.

Forecasting strategy is an algorithm of using ML model to make a multi-step time series forecasting.

Main forecasting strategies

Let’s start by looking at strategies that work with models predicting a scalar value. Almost all (if not all) the models in scikit-learn library are falling into this category. To work with a multidimensional target there is a special module multioutput.

Recursive strategy consists of training one model to make a one-step ahead forecast and applying it step after step until the desired horizon is predicted. Starting from the second step, instead of true lag values we use previous predictions of the model. Data for prediction looks like this:

The strategy is easy to understand and allows us to forecast for an unlimited horizon. A potential problem with this approach is that with each step, an increasing proportion of features will be occupied by predictions instead of true lags. And this can lead to the accumulation of errors.

Direct strategy consists of training independent models for each horizon step and applying them independently to form a forecast. In our example, this gives us 3 models for forecasting 1, 2 and 3 days ahead.

Every model uses only the lags that will be available during forecasting. For example, a model for forecasting two days ahead can’t use lag_1, but can use lag_2 and higher.

It doesn’t seem difficult to train 3 models, but what can we do when the horizon consists of 366 steps? Let’s look closer at our example with 3 models. If we take a model trained without lag_1 and lag_2, it can predict all the days from 11-th to 13-th, just some of the data is ignored. This is how the data for prediction looks like:

Let’s name this strategy as Simple Direct. We can go even further with this idea. Let’s find an intermediate approach between one model for all steps and a separate model for each step.

We can divide the forecast horizon into disjoint intervals, where each interval has its own forecast model. The closer the interval is to the end of the horizon, the fewer lags can be used by the corresponding model.

Since all predictions are made using true lags, this strategy does not lead to the accumulation of errors that the recursive strategy does. The disadvantage of the direct strategy is that the independence of the predictions can lead to inconsistency between them. Let’s assume that there is some dependency between the predictions for the first and second steps. We cannot guarantee that this dependency will remain within the direct strategy forecast.

DirRec strategy is a combination of two previous ones. We still train one model for each step, but instead of ignoring unavailable lags, predictions of previous models are used. Let’s look at prediction data for each model.

This approach solves the problem of non-cooperative predictions in the direct strategy, because the models are now able to learn the dependency between the predictions. What’s more, the problem of error accumulation is less pronounced if we keep passing on predictions to subsequent models during training. The disadvantage of this strategy is the difficulty of the scheme and its implementation.

So far we have had a look at models that can only predict a scalar value. But there are models that can predict a vector at once. This can be done using neural networks, for example. In this case, the model can predict all the steps of the horizon in one step.

MIMO (Multi-Input Multi-Output) strategy consists of training a model to predict the whole horizon as a target variable. It doesn’t suffer from accumulation of errors, because only one prediction step is used. However, if the horizon is large enough, forecast quality may deteriorate due to prediction of both short-term and long-term dependencies. Let’s look at an example of training data for such a model.

Training data for multidimensional model

We can also combine the discussed strategies to form a new strategy. For example, use direct strategy for a fraction of the horizon and then apply it recursively. Or train separate multidimensional models for different intervals of the horizon, as in a direct strategy.

Key points for the discussed strategies:

Comparison table of the discussed strategies

Forecasting strategies in ETNA

Available strategies in ETNA library:

Simple Direct, implemented by Pipeline. Fit one model that uses lags available for all steps in the horizon.
Recursive, implemented by AutoRegressivePipeline. Fit one model and apply it recursively.
Direct, implemented by DirectEnsemble. Build an ensemble of pipelines with different forecasting horizons. For each horizon step the prediction from the pipeline with the smallest horizon is taken.

Let’s look at all of them in practice. We will use some variation of the dataset from “The tourism forecasting competition” paper, where various methods were studied to predict the demand for tourism in Australia. The resulting dataset is artificial and uses fake timestamps to align all time series.

Download the dataset:

Data description:

We have monthly frequency, 366 segments and 333 months of data. Let’s assemble a simple pipeline for a prediction horizon of 24 months. For evaluation we calculate the average value of SMAPE metric across 3 folds.

Let’s look at our evaluation procedure closer. If we reduce horizon length by some factor and proportionally increase the number of folds, then we cover exactly the same period of time and therefore the average SMAPE across the new folds can be compared to the average SMAPE across original folds.

Armed with this knowledge, let’s see what would happen to the metric for the pipeline if we reduce the horizon length and use more recent lags. The horizon of 24 is very convenient for this.

SMAPE and evaluation time in seconds across different horizon lengths

As we can see, metric value is monotonically decreasing. It is expected because forecasting on a small horizon is easier. We can use these results as reference point for the following experiments.

Let’s move on to the recursive strategy. In our implementation we introduced an additional parameter step. To understand its meaning, let’s recall that in a recursive strategy we iteratively predict one step ahead. During each prediction we recalculate lags to include previously predicted values.

But what if we can make this recalculation less often? It is a time when the step parameter comes into play. It defines how many points we are forecasting without recalculation of the lags. For example, let’s assume that step=2. In this case the number of iterations is reduced by half and each iteration now consists of prediction 2 steps simultaneously. As a downside, we can’t use lag_1 because it won’t be available for the second point in our iteration.

Let’s make an experiment where we increase the step parameter from 1 to 24 to observe its influence. The smaller the step, the more iterations we need to make a forecast for the whole horizon.

SMAPE and evaluation time in seconds for recursive strategy across different step values

With step=24, we have a Simple Direct strategy. The metric was improving until step=12, after that the difference doesn’t look significant. Evaluation time increases as we decrease the step since more iterations are needed.

Let’s compare it with a direct strategy. We will split the horizon into disjoint equal-sized intervals to train a separate model for each interval. The parameter horizon_step represents the size of the interval. The smaller the horizon_step, the more models we need to use to forecast the whole horizon.

SMAPE and evaluation time in seconds for direct strategy across different horizon_step values

Improvements after horizon_step=12 are quite small. With horizon_step=1 we have the smallest metric value across all the tested strategies (experiment with smaller horizon values doesn’t count), but it comes with a biggest time cost. Compared to the previous experiment, evaluation time grows more significantly. The time is essentially proportional to the number of trained models.

The choice of a particular strategy depends on the client’s requirements. Let’s summarize the most interesting results in a table:

Conclusion

We learned that forecasting strategy is. Looked at the advantages and disadvantages of the main forecasting strategies. Tested the strategies implemented in ETNA on the example dataset and observed the improvement of SMAPE metric.

If you want to suggest a new feature, ask a question or recommend a topic for an article, welcome to our GitHub — all contacts are there. Feel free to leave a star.

For the most interested there are couple of papers dive deeper into this topic:

Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. “Machine learning strategies for time series forecasting.” Business Intelligence: Second European Summer School, eBISS 2012, Brussels, Belgium, July 15–21, 2012, Tutorial Lectures 2 (2013): 62–77.
Taieb, Souhaib Ben, et al. “A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition.” Expert systems with applications 39.8 (2012): 7067–7083.