Time series forecasting with ETNA: first steps

Published in

IT’s Tinkoff

10 min readApr 8, 2022

Hi! My name is Andrey, I’m a development lead of the ETNA, time series forecasting package. We already showed how to start forecasting time series with ETNA in our last article here at Medium.

Forecasting with ETNA: Fast and Furious

Forecasting hundreds of time series in a trace

medium.com

In this tutorial, I want to show how to use ETNA for simple time series analysis and introduce several feature engineering techniques already built-in in ETNA. I will be using the Tabular Playground Series with the Jan 2022 dataset from Kaggle as an example.

In this Kaggle competition, researchers should predict merchandise sales — mugs, hats, and stickers — at two imaginary store chains across three Scandinavian countries. The forecast is expected to be a year in advance and the quality of predictions is measured by the SMAPE metric.

Load the data

First of all, let’s load the data and take a look at it. I will use Pandas.

There are several columns: date, num_sold, country, store, and product. The description of the competition said that we need to forecast the country-store-item combination. We should find out how many combinations we have. Also, we need to make our dataset suitable for the ETNA package. ETNA expects the dataset to have a special column that specifies different time series. For that I will create a new column called “segment” — this is the name ETNA expects. The segment is a combination of country, store, and product columns. This is exactly what we need to forecast. And finally, let’s check how many combinations we got. Turns out 18: there are 3 products in 2 store chains, each of which is located in 3 countries.

Let us look at what we got:

Looks a lot better, but the dataset is still a few steps away from the ETNA format. I’m going to fix that:

This is the format that is required for the ETNA package. Target — is a reserved name for the column we want to forecast. Timestamp — is a column where we store our timestamp. We don’t call it “date” because ETNA knows how to work with hourly and weekly data as well.

To sum up, timestamp, segment, and target are compulsory columns for ETNA. However, ETNA can also work with exogenous data and we will show how to use it correctly in one of the next tutorials.

We are almost ready to analyze and forecast the data. To do that we need to put it inside a TSDataset class. TSDataset expects data in a multi-index format. But we can easily reformat the data with the method of TSDataset: TSDataset.to_dataset().

In our practice, it is easier to store data in the following format.

Finally, we create a new TSDataset instance.

Apt to ask why we need to do so much work to convert our data from one format to the other? TSDataset allows convenient indexing, validates data, works with other parts of the package, has built-in analysis methods, and generates necessary features for future prediction. I will show how TSDataset can be used in time series analysis and forecasting.

And here is the first TSDataset’s essential feature — indexing. I can easily index by timestamp, segment, and individual column name.

Time series analysis

After I’ve put my data in a TSDataset, I’d like to look at it. First of all, I will run a describe method. It will show me basic info about my dataset: how many time series I have, start date, end date, series length, and a number of missing values.

As we can see this dataset is perfect. No missing values and all series end at the same date.

Also we can visualize the data, by using a plot method.

The series has yearly seasonality. Peaks are arranged not in a random fashion. It could be holidays of some sort.

Let’s zoom in to see if the series has monthly or weekly seasonality.

By analyzing four segments, it is safe to say that weekly seasonality is present.

Feature generation

In this section I’d like to focus on features that could be applied to time series.

Lags

Lag is some previous value of the time series. For example, the first lag is yesterday’s value, and the fifth lag is the value five days ago. Lags are essential for regression models, like linear regression or boosting, because they allow these models to grasp information about the past.

The model tries to predict the next value by looking at the lags — the previous values of the series.

Let’s try to use ETNA’s LagTransform for our data. First of all, we import it from the ETNA package. Then I will try to generate the first lag by setting parameter lags=[1]. As we can see, there is a new column with the lag we wanted. LagTransform is applied for all time series in our TSDataset.

If we need to generate several lags it can be easily done with ETNA. All we need is to specify all the necessary lags in a list.

And for more advanced users, lags parameters could be set by using range function or list comprehension.

Statistics

Statistics are another essential feature. It is also useful for regression models as it allows them to look at the information about the past but in different ways than lags. I’m talking about the mean, median, standard deviation, minimum and maximum on the interval.

I will show how all this works using MeanTransform. MeanTransform is also easy to use. To make it up and running we just need to specify one parameter — window. Let’s do that!

But what does it mean? The window specifies how many previous values we want to average. Let’s understand it step by step.

Step 1. MeanTransform wants to average 5 values previous to value 18, including it. But we have no values before it. So the average value is 18.

Step 2. Before 26, including itself, we have only values 26 and 18. So MeanTransform averages only them. And so on. Up until step 5.

Step 5. At this step MeanTransform averages all five values.

For the next steps, MeanTransform averages only five previous values including the current one. Why five? Because this is the window that we chose.

This feature is useful when we want to pass the info about mean value during the last week or month. But we also may want to use the information about the average value of every seventh day for example. It can be useful if we expect weekly seasonality in our series. Fortunately, we can easily do it with ETNA. Here is the example for every second day:

Here we indicated that we want to average 2 points that go in increments of 2. Let me make that clear with another gif:

Dates

The time series also has the timestamp column that we have not used yet. But date number in a week and in a month, as well as week number in year or weekend flag can be really useful for the machine learning model. And ETNA allows us to extract all this information with DateFlagTransform. Let’s apply it to our dataset:

We do not need to set the column name we want to apply this transform to. Setting up flags that we are interested in will be enough. There are 9 available flags, I chose only 4 of them.

Holidays

HolidaysTransform also works with the timestamp column. It uses a holiday package that has knowledge of the holidays for most of the countries. Users need to specify only the ISO-code of a country and HolidaysTransform is good to go.

Logarithm

I’ve already told you about transforms that use target columns and about transforms that use timestamp columns to create new features columns. However, there is another type of transform that alters the column itself. We call it “inplace transform”. The easiest is LogTransform. It logarithms values in a column.

We can use the inverse_transform method to unapply this transform.

And if we want to get the logarithm of our values and still keep the original column we can set the parameter inplace=False for the transform.

Forecasting

We’ve learned a lot about transforms and how to use them. Let’s use this knowledge to predict sales of the merchandise from Kaggle Tabular Playground Series — Jan 2022.

I will use a linear regression model. And that’s why I need to generate features. The difference between autoregression and regression models we will cover in next tutorials.

As you know we need to forecast sales year in advance. So the horizon of our prediction will be 365. And, by doing prior analysis we know that the series has weekly seasonality, I will use lags from 365 to 371. But why can’t we just use lags from 1 to 7? Because the model won’t be able to forecast anything due to the fact that it won’t get feature values for all horizons more than 7. Conclusion: the horizon defines the smallest lag that we can use.

Let’s visualize it for a smaller horizon. Imagine that the horizon is 3. Model trains on lags and tries to predict the series value. If we use lag 1 and 2, we get our series shifted by 1 and 2. So at horizon 2 we won’t get values for lag 1, and at horizon 3 we won’t get values for lag 1 and lag 2.

So the smallest usable lag is 3. Same logic applies to statistics, so I apply MeanTransform to lag 365. And I want to consider weekly seasonality in MeanTransform, that’s why I set seasonality to 7 and window to 104. It means that I want to average every weekday for the last 2 years. For mondays it is average of 104 prior mondays, for tuesdays it is average of 104 prior tuesdays and so on.

Let’s look at the holidays. Competition says that stores are located in three countries. So I set 3 holiday features for these countries. However I expect sales to be different after the holidays, so I apply lags to holiday features as well.

And finally, let’s move on to training.

LinearPerSegmentModel is a model I use. PerSegment means that for every segment will be created and fitted separate linear model.

Pipeline is supposed to make transforms and models work together. It also allows running backtest for the series. It makes the researcher’s life easier.

SMAPE is a metric. We could interpret it as a percentage error of our model prediction.

I run a backtest to check the model’s performance, calculate metrics and plot the forecasts.

I use the plot_backtest function. This is one of ETNA’s plotting functions. It allows us to compare real and predicted values. Seems like our model is okay. Let’s fit on the whole dataset and forecast the future! For that I use an already configured pipeline.