ETNA Regressors

Artem Makhin
IT’s Tinkoff
Published in
5 min readJun 3, 2022

Hi there! That’s Artem, I help to develop ETNA, a time series forecasting package.

Often we need some additional information that helps us to make predictions more accurately. The calendar of holidays will help us to predict an increase in sales of champagne before the New Year, the knowledge that there was a promotion for strawberry milkshake will help to understand why the demand has changed — and to predict it at the end of the promotion.

If such data are known for some period in the future, then it is called regressors. In this tutorial, I will show how regressors can help in time series forecasting and how to work with it in the ETNA library. As an example, we will predict the sales of goods in the store.

Load the data and forecast

Let’s load the data and open it. The data contains daily records of product sales for three years (product.csv) and schedule of 1000 recurring promotions held by the store (promo.csv) (1 — this day the promotion was active, 0 — not active). Promotions schedules are regressors that can help us better predict product sales. For clarity, we will consider only 6 products.

We will forecast 28 days ahead and the quality of predictions is measured by the SMAPE metric. Let’s compare the forecasts with regressors and without it — and see the usefulness of such data.

We will use CatBoostModelPerSegment as the model. PerSegment tells us that there will be an own model for each product. CatBoost tells us that the model will be CatBoost gradient boosting.

We need to generate features ourselves for this model. Let’s make a list of transformations that generate features of the day (number of the day in the week, in the month; number of the week in the month and in the year; number of the month in the year) and 11 lag features — from 29 to 49 inclusive.

Let’s run a backtest to calculate the metrics. I’ve already described lags, Pipeline, backtest, and why we can’t choose less lags.

Let’s visualize the results. The model does not predict small changes in sales well.

Regressors

Regressors are additional time series whose values ​​are known not only in the past, but also in the future. Such series can be used to extract more information from the process for improving forecasts.

Regressors can be either predetermined or predictable. Holidays, planned promotions — are predetermined ones. We always know exactly when and what event will take place. To the predictable — the company’s forecasts for the number of employees, the number of customers or sales volumes. Here we do not know the exact values ​​for the future, however, we can use expert opinion to get estimates in the future.

Let’s go back to our in-store sales forecasting problem. The schedule of promotions for goods in the past and in the future is already known. We want to use this information to make a prediction. But we do not know whether a particular promotion affects any product, and if it does, how exactly. Therefore, lets use all 1000 promotions without selection.

To use promotions data in ETNA, it must be converted to the ETNA format.

When creating a TSDataset, we specify additional data in df_exog, in known_future we pass a list of additional series that are known for the future (that is, which are regressors). In this case, all additional data are regressors, so “all” can be specified.

We received a TSDataset, which contains information about the original sales series and all additional data. Models that can work with additional data will automatically use all available information from TSDataset.

Forecast with regressors

In order to make a forecast with regressors, no additional steps are needed — everything has already been done at the stage of creating a TSDataset. We do everything exactly the same as in case without regressors.

Let’s visualize our forecasts with regressors. In this case we got more accurate forecasts both metrically and visually. The model was able to extract the right information from our regressors and use it to make a more accurate prediction.

Metrics comparing:

Let’s look at the importance of features. To do this, there is a function plot_feature_relevance, the input of which must be supplied with a method for calculating the importance. In this case, the calculation method is ModelRelevanceTable, which means that the calculation of importance is carried out by the model. We will also choose CatBoostRegressor as a model. The importance of each feature, obtained from the CatBoostRegressor model, is normalized so that the sum of the importances of all features will be 100.

We can get some insights from the importance of the features. For example, item 0 is heavily affected by 438 and 255 promotions, while item 7 is heavily influenced by month and week numbers of the year (probably sales of this item have a strong annual seasonality).

Results

How, we’ve seen:

  • What regressors are
  • How to work with them within the framework of the ETNA library

In the next articles, we will discuss useful features of the ETNA library. If you‘d like to propose a new feature or ask a question, welcome to our GitHub page.

How to get used data (hidden section)

Sales and promotions data taken from source. They are artificial. There are no dates in the data, so we set the dates for the series ourselves. Below is the code to get promo.csv and products.csv files.

--

--