Scalable time series forecasting

Published in

Data Science at Microsoft

10 min readJun 21, 2022

In an earlier article one of us showed how Prophet can be used to forecast at scale and detect anomalies. Because the previous article was well received and the problem of scalable time series forecasting has multiple applications, we decided to create additional articles on this topic to share different ways to solve this problem while comparing them, and this is the first of those.

Of course, forecasting-based anomaly detection means that the crux of the problem is about forecasting, and so moving forward we drop anomaly detection from our problem statement. Also, to make the problem more generalizable, instead of doing single horizon forecasts we will make multi-horizon forecasts.

Mathematically, we can now define the problem as:

Given N univariate time series representing daily data, generate forecasts for days Ti, Ti+1, Ti+2 … Ti+n.

We use Temporal Fusion Transformers (TFT) as the forecasting model and show how they outperform the approach discussed in the previous article.

Introduction to TFT

A Temporal Fusion Transformer (TFT) is an attention-based Deep Neural Network, optimized for superior performance in multi-horizon time series forecasting while producing interpretable insights into temporal dynamics. During benchmark testing, TFT has outperformed traditional statistical models including Auto Regressive Integrated Moving Average (ARIMA) and Deep Neural Network (DNN)–based models such as DeepAR, MQRNN, and Deep Space-State Models.

Multi-horizon forecasting is the prediction of variables-of-interest at multiple future time steps. This is a crucial problem within time series Machine Learning. For example, retailers can use demand prediction to optimize their supply chain; investment managers can use a forecast of future prices of financial instruments to maximize the performance of their portfolio; cloud companies can use a multi-horizon forecast of cloud usage to understand demand or a surge that might arise in the future and use this information to strengthen their network to mitigate the future outages; and more. In contrast to “one step ahead” predictions (i.e., traditional time series forecasting), multi-horizon forecasting provides users with access to estimates across an entire forecast path, allowing the optimization of actions at multiple steps in the future. This is useful to retailers, for example, for optimizing inventory for an entire upcoming season.

Another salient feature of TFT is its interpretability, which is generally absent in most traditional time series and DNN-based forecasting methods. Yet another is its provision of prediction intervals that can be used for optimizing decisions by yielding an indication of the likely worst-case values (lower and upper bounds) that the target can take.

TFT uses quantile regression to find the quantile forecast for each time step. By default, TFT’s Pytorch implementation provides a forecast for the second, tenth, twenty-fifth, fiftieth, seventy-fifth, ninetieth, and ninety-eighth quantile at each time-step.

The prediction intervals can be further applied in anomaly detection, in which a user can define the lower and upper limit of the forecast; in cases where the actual value of the forecasted point lies outside this range, the point can be considered an anomalous point. To scope this article and continue the practice of the first article, we will use only data with univariate time series for forecasting, but please note that TFT also works with multivariate time series.

TFT architecture

There are multiple building blocks (components) in TFT that specialize in finding patterns in univariate (as well as in multivariate) time series.

There are five major building blocks of TFT, which we review in turn:

Gating mechanisms

TFT proposes a unit called a Gated Residual Network (GRN) that skips over any unused component of the model (as learned from the data), providing adaptive depth and network complexity to accommodate a wide range of datasets and variations.
For some data, non-linear processing requires understanding the distributions but for some simpler models this can be beneficial — such as where datasets are small or noisy. GRN gives the model a flexibility to apply non-linear processing only where needed, meaning the models will learn this from data and based on that, they will decide whether to use the residual connection to skip the dense layer or use the dense layer.
GRN has two dense layers and two types of activation functions: Exponential Linear Unit (ELU) and Gated Linear Unit (GLU). ELU is used to generate a constant output that results in linear layer behavior, while GLU allows TFT to control the extent to which GRN contributes to the original input, potentially skipping over the entire GRN layer if necessary.

Variable Selection Networks (VSNs)

VSNs are used to select relevant input variables at each time step, removing any unnecessary noisy inputs that might have a negative impact on performance. Most real-world datasets contain many variables that might not have predictive power, and so removing them helps the model depend only on variables with more predictive power.
VSNs encode the variables based on their types — categorical or numerical. For categorical variables entity embeddings are used, while numerical variables are put through linear transformations. Each input variable is transformed into a d-dimensional vector that matches the dimensions in subsequent layers for skip connections.

Static Covariate Encoders

These components of TFT are designed to integrate the information from static metadata, using separate encoders to produce four different context vectors: Cs, Ce, Cc, and Ch. These vectors are then fed into the decoders.
Cs is the context vector for temporal variable selection, Cc and Ch are the context vectors for local processing of temporal features, and Ce is the context vector used to enrich the temporal features with static information.
Static features can have an important impact on forecasts. For example, various store locations could have different temporal dynamics for sales (a rural store might see higher weekend traffic, but a downtown store might see daily peaks after working hours).

Temporal processing

Temporal processing is used to learn both long- and short-term temporal relationships from both observed and known time-varying inputs.
A seq2seq layer is used to enhance the locality and surrounding enhancements. Surrounding values are often used in identifying events such as anomalies, change-points, or cyclical patterns.
A multi-head attention block is used to learn long-term relationships across different time steps.

Prediction intervals

In time series forecasting, predicting only the target is insufficient; it is also important to estimate the uncertainty of the predictions.
With the prediction interval enabled in the output of TFT, the standard regression MSE/MAE loss function becomes quantile loss, which is defined as:
Quantile loss: max(q * (y — y_pred), (1 — q) * (y_pred — y))
To understand quantile loss in detail, refer to this blog.

TFT interpretability

In consideration of variable importance, we can observe how different variables impact the target by observing their model weights in a way similar to how feature importance is interpreted in tree-based ensemble methods. For example, while predicting retail sales for a given retail store, the largest weights for static variables were the specific store and item, while the largest weights for future variables were promotion period and national holiday.

Variable Importance. Source: https://arxiv.org/pdf/1912.09363.pdf on arxiv.org.

The illustration below shows how variables like holidays and store closings are bringing down the predictions as compared to normal days mimicking actual dips in sales.

Model Interpretation. Source: https://ai.googleblog.com/2021/12/interpretable-deep-learning-for-time.html on ai.googleblog.com.

Data

We have 140,169 univariate time series, 54 days of historical data for training the model, and seven days for forecasting and evaluating the model for all the time series. For training, the data was split into groups of 80 percent for training, 10 percent for validation, and 10 percent for test. To maximize inference all the time series were used.

Hardware

We use a Tesla K80 GPU with six cores and 56 GB RAM for training and inferencing.

TFT model training and evaluation structure

For the training we split the historical data of 54 days into the encoder of 47 days and decoder of seven days. For the inference/evaluation we shift ahead the encoder and decoder by seven days. During inference the decoder of seven days (horizons) will be predicted using the encoder of 47 days.

Model training and inferencing structure. Illustration by author Moid Hassan.

Metrics for comparing TFT and Prophet

Useful comparison metrics include Mean Absolute Error, Symmetric Mean Absolute Percentage Error, standard deviation of SMAPE, accuracy, and compute cost. We briefly review each in turn.

Mean Absolute Error (MAE)

Mean Absolute Error is the average of all absolute errors. The formula is:

Absolute Error and Mean Absolute Error. Source: https://www.statisticshowto.com/absolute-error/ on statisticshowto.com.

Where:

n = number of observations in the data
∑ = summation over all observations
|xi — x| = the absolute errors

Symmetric Mean Absolute Percentage Error (SMAPE)

SMAPE is used as an alternative to MAPE (as the latter puts heavier penalties on negative errors than on positive errors because percentage error cannot exceed 100 percent) and has both lower (0 percent) and upper (200 percent) bounds — hence it’s called symmetric MAPE. The formula is:

Definition of SMAPE. Source: How to find symmetric mean absolute error in python? on stackoverflow.com.

Standard deviation of SMAPE

The standard deviation of SMAPE measures how SMAPE averaged across horizons is spread across all the more than 140,000 time series.

Accuracy

The percent of accurate time series refers to the percentage of time series with less than 10 percent, five percent, and one percent SMAPE averaged across horizons.

Compute cost

Because using resources is not free, compute cost references the time taken in training the model and inferencing from the model multiplied by the cost of the virtual machine that is used for doing the work.

Comparative results

To understand performance, we look at overall results along with some alternatives.

Overall results

In looking at all 140,169 time series with seven-day forecasting, the overall numbers where the errors were calculated across all the horizons without segregation are as follows:

From this information, we see the following improvements:

TFT improved the MAE by approximately 38 percent.
TFT improved the SMAPE by 6.5 percent.
TFT cut the compute cost by 80 percent.

Results across horizons (days)

We see the following results across horizons pertaining to SMAPE and MAE:

From this information, we see the following:

SMAPE for Prophet is better when the horizon of forecast is small, with the error increasing as we increase the horizons (days) for forecasting — which is not the case for TFT, where the SMAPE is stable across all the forecasted horizons. Also, for the last four horizons the SMAPE is better for TFT than the Prophet model.
MAE for TFT is much lower than the prophet model across all the horizons. Even though the MAE is degrading for TFT as we move across horizons, the delta is much lower for TFT, which tells us that we can have more confidence in TFT for the forecast over several horizons.

Standard deviation of SMAPE

A comparison of the standard deviation of SMAPE between Prophet and TFT shows a lower one for TFT:

Percent of accurate time series

A few accuracy measures are as follows:

From this information, we see the following:

The percentage of time series with SMAPE averaged across horizons by less than 10 percent is higher by one percent for Prophet and lower by two percent for time series with less than five percent SMAPE.
Percentage of time series with less than 1 percent average SMAPE across horizons is 24 percent higher for TFT as compared to Prophet. This tells us that the TFT is able to capture the patterns in the historical data more accurately than Prophet model.

Conclusion

In this article, our aim has been to create forecasts for multiple univariate time series in a scalable manner, focusing on multi-horizon forecasts instead of single horizon forecasts. We used the Temporal Fusion Transformer (TFT) to solve this problem given its capability to generate accurate multi-horizon forecasts with computational efficiency.

Because traditional forecasting methods do not scale with an increase in the number of time series, we compared the results of TFT with the optimized Prophet approach as discussed in the earlier article and showed that TFT outperforms on almost all metrics when doing multi-horizon forecasts. Prophet’s error rate increases at a much higher rate than TFT, so as we increase the forecast horizon, one might consider using the optimized Prophet approach over TFT only for a forecast horizon of one to two days, and even then TFT can be preferred given its cost and computational efficiency.

Thanks for reading this article. We welcome any constructive feedback you may have. Please leave your feedback in the Comments section below. We are also on LinkedIn, with links to our profiles in our byline at the beginning of this article.

Scalable time series forecasting

Introduction to TFT

TFT architecture

Gating mechanisms

Variable Selection Networks (VSNs)

Static Covariate Encoders

Temporal processing

Prediction intervals

TFT interpretability

Data

Hardware

TFT model training and evaluation structure

Metrics for comparing TFT and Prophet

Mean Absolute Error (MAE)

Symmetric Mean Absolute Percentage Error (SMAPE)

Standard deviation of SMAPE

Accuracy

Compute cost

Comparative results

Overall results

Results across horizons (days)

Standard deviation of SMAPE

Percent of accurate time series

Conclusion

Written by Moid Hassan