ETNA: Time Series Analysis. What, why and how?

Published in

IT’s Tinkoff

8 min readJul 17, 2022

Hi there! My name is Sasha and I develop the ETNA — time series forecasting package. This article will be dedicated to the EDA (Exploratory data analysis) methods in the ETNA package. I will try to show how to find something interesting in your data with the help of these methods and explain how to use these findings to improve your predictive model.

About the dataset

Today we will try to find something interesting in the time series dataset from the Web Traffic Time Series Forecasting competition on Kaggle. The aim of the competition is to forecast the future web traffic for approximately 145,000 Wikipedia articles. The quality of predictions was evaluated on SMAPE, so we will also focus on this metric.

If you can’t wait to build the baseline for this competition — you are welcom to read our previous article. There we explained how to forecast time series with our library and how to use additional data for this purpose. If the baseline has already been built, the metrics have stopped improving and there is no joy in your life, it’s time to take a step back and proceed to EDA.

Outliers, missing values and the other beasts

Missing values

By carefully reading the description of the competition, you can find out that there are missing values in the dataset. Caring organizers tell us that missing values may arise due to absence of data for that day or simply due to the zero traffic. Let’s understand how to deal with this unpleasant fact and visualize various popular gap filling strategies.

Filling all the gaps with zeros doesn’t look really sensible:

There might be zero visits on some days, but we cannot accurately determine such days, so filling in all the gaps with zeros is not the right strategy.

The fill forward strategy looks more attractive, however it may work poorly if the gap is an entire interval:

Filling in the gaps with the last known value, we assume that the number of visits on the current day does not differ from the number of visits on the previous one. This assumption is crude, but it avoids the appearance of unexpected zeros. But if the gap is an entire interval, replacing it with a constant is implausible - we want to keep the trend.

The moving average of the last 3 days looks like a perfect candidate:

Filling in the gaps with the moving average of the last three days allows you to capture the current trend in the number of visits. In the previous strategy, the current trend was not taken into account, which was significant drawback.

Outliers

Well, so far we worked with the missing values. Next we are going to focus on the outliers and anomalies. We define the outliers as the unexplainable peaks at random time points.

Usually, they occur as the result of data source failures or dataset collection mistakes and do not reflect the entire behavior of the time series. There are a vast number of different strategies for handling outliers for every taste and color. For example, you can try to add a special indicator feature, replace outliers like missing values, or simply throw them out of the dataset. We will use the second strategy.

Our library contains a set of outliers detection methods, and curious readers are welcome to explore the corresponding notebook. Unfortunately, all the methods suffer from the same “ailment” — they require the hyperparameters’ selection, so we can’t avoid this step either.

Visualization of the outliers found by the density-based method.

By the way, should we actually care about the outliers? (Spoiler: of course we should, what are you talking about?) Let’s predict our time series 6 weeks ahead, leaving the outliers in place, and then “correcting” them, as we discussed above.

SMAPE:

With outliers: 33.85
Without outliers: 28.88

Outlier processing improved the quality by about 15%! Seems to be an important step, isn’t it?

Anomalies

We cannot predict the appearance of the outliers in the time series due to their random nature, however, there are other anomalous areas in the series. Some of them at first glance do not differ from outliers (for example, peaks on holidays), however this behavior is expectable and we can try to force the model to handle such occasions. Let’s figure out how to handle the special events on the example of a time series associated with one TV-series.

The number of visits of the series page per day. Peaks correspond to the release of new seasons

Abnormal behavior at the beginning of the year is nothing more than the release of the new season. Using the effect of the release of the first season of the series in 2016, we want to estimate the page traffic during the release of the new season in 2017. The famous Prophet, for example, can handle the effect of such events out of the box, however again we need to select the hyperparameters.

Visualization of the boundaries of the impact of the release of the new season on page traffic.

The release dates of the series are known for the next year, the width of the window in which the traffic is affected with the release we have estimated in the previous step, therefore we are ready to make forecasts for 2017. Let’s try to do it in three ways.

Default parameters: the model managed to catch the weekly seasonality, but it failed to capture the release of the new season in the forecast.

Annual seasonality included: the annual seasonality allowed us to catch the peak associated with the release of the new season, since the release of the new season is an annual event. However, it was not possible to detect an increase in the dispersion within the peak.

Annual seasonality and event capture included: information about the release of the new season has noticeably improved the quality of the forecast.

And now let’s compare SMAPE:

Baseline: 171.94
Annual seasonality: 41.27
Annual seasonality + event: 20.65

Without handling the effect of the release of a new season in the model, the forecasts look quite poorly. Adding the annual seasonality into the model, of course, significantly improves the picture, however the new season features improve the metric by 2 times more.

Trend and seasonality

Well, the outliers have been beaten, the effect of the special events has also been handled, what’s next? Next, let’s look at global properties of time series, such as trend and seasonality.

Seasonality — periodic fluctuations in the values of the time series.

We have already testified the metrics improvement after the inclusion of annual seasonality into the model. However, there is a suspicion that there exists smaller seasonalities in the series, let’s verify this hypothesis. To determine the series seasonality, let’s try to use a couple of tools from our library. Surely the majority of novice time series analysts have come across an autocorrelation plot. As usual, we look at the neighboring significant peaks and see the weekly seasonality.

Now let’s take a look at a more exotic plot, the so-called periodogram. It shows the amplitudes of the Fourier series components with a period of one year. We see a peak at point 52, typical for weekly seasonality (52 times a year), as well as peaks near 1 and 2, showing annual and semi-annual seasonality.

On the periodogram we see a peak at point 52, typical for weekly seasonality (52 times a year), as well as peaks near 1 and 2, showing annual and semi-annual seasonality.

This time for the sake of diversity let’s try out the Catboost model. We will model the weekly seasonality with a weekday label and annual/semi-annual seasonalities with Fourier features.

Catboost forecast based on seasonal features.

As we can see, the model has successfully captured the annual and weekly seasonality, now it’s time to catch up with the trend.

Trend — a global tendency in the change of time series characteristics.

Our library contains several trend modeling methods, which can be conveniently visualized (and, of course, included into the forecasting pipeline).

Visualization of a linear and piecewise linear trend. The graph shows that the trend of the series has changed several times, so it is better to model it with a piecewise linear function.

The visualization tells us that the trend of the series is best modeled by a piecewise linear function. Let’s check if this is a true assumption in practice. We will try out the three strategies of capturing trend: without trend modeling, trend modeling with a linear function and trend modeling with a piecewise linear function. As before, we will compare the strategies with SMAPE:

Without trend modeling: 16.77
Linear trend: 19.10
Piecewise linear trend: 10.73

Using a piecewise linear function for trend modeling improves the quality of the forecast by 37% compared to the forecasting pipeline without trend modeling. At the same time, modeling the trend with a linear function even degrades the quality.

Model Quality Analysis

We explore various features on the individual segments, but it is worth recalling that the dataset consists of a 145k time series. Let’s take pity on our computational environment and try to make forecasts for random 100 of them.

Imagine that the data preprocessing stage is already over, you have built some kind of forecasting pipeline and want to evaluate how good it is. Let’s perform a backtest and visualize the results as we did before… Although I’m not quite sure that we really want to look at pictures with 100 segments. Let’s better take a look at the distribution of the target metric.

On different folds, the metric is distributed in a similar way, which means that the model is stable. The graph shows that the distribution of the metric has a wide spread, which means that the quality of forecast varies from segment to segment.

Wow, nice spread. Now, let’s visualize metrics for the top-best and top-worst segments:

Ranking from best to worst segment in terms of SMAPE helps to select segments with high forecasting quality.

The 10 worst segments in terms of forecasting quality should be identified in order to analyze them separately.

We can also easily examine the feature importance:

The most important features for the model are Fourier features.

Improving the current pipeline is left as an exercise to the reader ;)

All in all

In the article, we examined various data features that can significantly affect the forecasting quality, including anomalous behavior, outliers, seasonality, and trend. EDA methods help us to detect these features, what is more, most of the interesting findings from these methods can be easily included into the forecasting pipeline and cheaply improve the quality of our model. I hope, I managed to convince you that EDA is a useful step in building a pipeline.

If you enjoyed the article, feel free to live stars on our GitHub. For those who want to dig into the data on their own and try out all our EDA methods in action, I am attaching the code for downloading the data.