Coffee Time Papers: Long-term Forecasting with Time-series Dense Encoder

9 min readJun 19, 2024

This blog post is part of the series Coffee Time Papers.

Paper

Overview

The paper proposes TiDE (Time-series Dense Encoder), a new deep learning model for long-term time-series forecasting. The model is designed to address the limitations of existing methods, such as the suboptimal performance of Transformers in long-term forecasting and the inability of linear models to capture non-linear dependencies and incorporate covariates.

Key Features and Contributions

Architecture: TiDE is an encoder-decoder model based on Multi-layer Perceptrons (MLPs). It encodes past time-series data and covariates using dense MLPs, and then decodes the encoded information along with future covariates to generate predictions.
Theoretical Analysis: The authors provide a theoretical analysis of a simplified linear version of TiDE. They prove that this linear model can achieve near-optimal error rates for linear dynamical systems (LDS) under certain assumptions. This analysis helps to explain why simple linear models can sometimes outperform more complex Transformer-based models in long-term forecasting.
Empirical Evaluation: TiDE is evaluated on seven popular long-term forecasting benchmark datasets. It demonstrates superior or comparable performance to state-of-the-art Transformer-based models while being significantly faster (5–10x) in terms of both inference and training time.
Handling Covariates: Unlike linear models, TiDE can effectively incorporate both static and dynamic covariates, which are often crucial for accurate forecasting in real-world scenarios.
Temporal Decoder: A unique feature of TiDE is the temporal decoder, which allows for direct adaptation to future covariates at each time step. This is particularly useful when certain covariates have a strong and immediate impact on the predicted values.

Experimental Results

Long-term Forecasting Benchmarks: TiDE outperforms or matches the performance of existing methods on benchmark datasets, including Electricity, Traffic, Weather, and ETT. It is particularly effective on the largest dataset (Traffic), where it significantly outperforms the best Transformer-based model.
Demand Forecasting (M5 Competition): TiDE showcases its ability to handle complex covariates in the M5 forecasting competition. It outperforms both DeepAR (a model designed for handling covariates) and PatchTST (the best-performing model from the benchmarks) by a substantial margin.
Efficiency: TiDE is shown to be much more efficient than PatchTST in terms of both training and inference time, especially for long context lengths. This efficiency is attributed to the linear scaling of TiDE’s computational complexity with respect to context and horizon lengths.

Q & A

Q: What is the main objective of the paper?

A: The main objective of the paper is to introduce TiDE (Time-series Dense Encoder), a new deep learning model designed for long-term time-series forecasting. The model aims to address the limitations of existing methods, such as the suboptimal performance of Transformers in long-term forecasting and the inability of linear models to capture non-linear dependencies and incorporate covariates.

Q: What’s the related prior work?

The paper discusses several categories of prior work in long-term time-series forecasting:

1)Multivariate Models: These models predict the future of all time-series variables jointly, considering their interdependencies. Examples include classical VAR models and deep learning models like LongTrans, Informer, Autoformer, FEDFormer, and Pyraformer. These deep learning models often employ variations of the Transformer architecture with modifications to handle long sequences efficiently.

2) Univariate Models: These models predict the future of each time-series variable independently, focusing on its own past values and potentially incorporating covariates.

Local Univariate Models: These are trained and used for inference on a per-time-series basis. Classical models like AR, ARIMA, and exponential smoothing fall into this category.
Global Univariate Models: These ignore variable-specific information and train a single shared model for all time series. Deep learning architectures like N-BEATS and its extension N-HiTS are examples of global univariate models.

3) Linear Models: Recent work has shown that simple linear models, such as DLinear, can outperform complex Transformer-based models in some long-term forecasting benchmarks. However, linear models have limitations in modeling non-linear dependencies and incorporating covariates.

4) Models for Long-Range Dependencies: Research has also focused on improving the ability of Recurrent Neural Networks (RNNs) and State Space Models (SSMs) to capture long-range dependencies in sequences. While these models have shown promise in other domains, their application to time-series forecasting is still being explored.

The paper positions TiDE as a new approach that aims to combine the strengths of linear models and deep learning architectures while addressing their limitations. It incorporates non-linearity through MLPs, effectively handles covariates, and introduces a novel temporal decoder for adapting to future covariates.

Q: Is TiDE a multivariate or univariate model?

TiDE is a global univariate model. Although it is trained on all time series in the dataset, during inference, it predicts the future of a time-series variable as a function of only the past of the same time-series and covariate features.

Q: What are the key observations which led to TiDE?

The development of the TiDE model was motivated by several key observations from prior work in time-series forecasting:

Transformers’ Limitations: While Transformers have shown great success in various sequence modeling tasks, their performance in long-term time-series forecasting has been suboptimal. This observation suggests that the self-attention mechanism, a core component of Transformers, may not be the most effective approach for capturing long-range dependencies in time-series data.
Linear Models’ Strengths: Simple linear models have surprisingly outperformed complex Transformer-based models in some long-term forecasting benchmarks. This highlights the potential of linear models for capturing certain patterns in time-series data, such as trends and seasonality.
Need for Non-Linearity and Covariates: Linear models, however, have limitations in modeling non-linear dependencies and incorporating covariates, which are often essential for accurate forecasting. This observation indicates the need for a model that combines the simplicity and speed of linear models with the ability to handle non-linearity and covariates.

These observations led the authors to propose TiDE, a model that leverages the strengths of both linear models and deep learning architectures. By incorporating MLPs for non-linearity and explicitly handling covariates, TiDE aims to achieve superior performance in long-term time-series forecasting while maintaining efficiency and simplicity.

Q: What’s the architecture of TiDE?

The TiDE architecture is an encoder-decoder model based on Multi-layer Perceptrons (MLPs). It is designed to process time-series data and covariates to generate long-term forecasts. The architecture consists of the following key components:

Residual Block: This is the fundamental building block of the model. It comprises an MLP with one hidden layer and ReLU activation, along with a skip connection. Dropout is applied to the linear layer connecting the hidden layer to the output, and layer normalization is used at the output.

Encoding:

Feature Projection: A residual block is employed to map the dynamic covariates at each time step into a lower-dimensional space. This dimensionality reduction step helps manage the potentially large size of the input vector.
Dense Encoder: The past and future projected covariates, along with static attributes and past time-series values, are stacked, flattened, and concatenated. This concatenated input is then passed through an encoder consisting of multiple residual blocks to generate a dense representation of the features.

Decoding:

Dense Decoder: The dense representation from the encoder is fed into a decoder composed of several residual blocks, similar to the encoder. The output of the decoder is a vector that is reshaped into a matrix, where each column represents the decoded vector for a specific time period in the horizon.
Temporal Decoder: This component generates the final predictions by combining the decoded vector for each time step with the projected covariates of that time step. It acts as a “highway” from future covariates to the prediction, allowing for direct adaptation and potentially improving accuracy when covariates have a strong immediate effect.

Additionally, a global residual connection linearly maps the look-back (past time-series values) to a vector the size of the horizon, which is then added to the final prediction. This ensures that a purely linear model is always a subclass of the TiDE model.

The model is trained using mini-batch gradient descent with Mean Squared Error (MSE) as the loss function. Evaluation is performed using rolling validation, where the model is assessed on all possible look-back and horizon pairs in the test set.

Q: What covariates mentioned in the paper?

In the context of the paper, covariates are additional pieces of information that can be used to improve the accuracy of time-series forecasting. The paper distinguishes between two types of covariates:

Dynamic Covariates: These are variables that change over time and are known in advance for both the look-back (past) and horizon (future) periods. Examples include:

Time-Derived Features: Day of the week, hour of the day, holidays, etc., which are common to all time-series.
Time-Series Specific Features: Discounts on a particular product in demand forecasting, or weather conditions for a specific location in energy forecasting.

Static Attributes: These are time-independent features of a time-series. Examples include:

Product Features: Brand, category, size, etc., in retail demand forecasting.
Location Features: Population density, climate zone, etc., in energy forecasting.

The paper emphasizes that incorporating covariates is crucial for accurate forecasting, as they can provide valuable information about the factors influencing the time-series. The TiDE model is designed to effectively handle both types of covariates, allowing it to leverage this additional information to improve its predictions.

Specifically, the model’s architecture includes a feature projection step to reduce the dimensionality of dynamic covariates and a temporal decoder that enables direct adaptation to future covariates at each time step. This design allows TiDE to capture both the long-term dependencies in the time-series data and the immediate impact of covariates on the predicted values.

Q: What are the key features and contributions of TiDE?

A: The key features and contributions of TiDE include:

A novel architecture based on Multi-layer Perceptrons (MLPs) for encoding and decoding time-series data and covariates.
Theoretical analysis demonstrating the near-optimal error rate of a simplified linear version of TiDE for linear dynamical systems.
Empirical evaluation on benchmark datasets showing superior or comparable performance to state-of-the-art Transformer-based models, while being significantly faster.
Effective incorporation of both static and dynamic covariates, which are crucial for accurate forecasting in real-world scenarios.
A unique temporal decoder that allows for direct adaptation to future covariates at each time step.

Q: How does TiDE compare to other models in terms of performance?

A: TiDE outperforms or matches the performance of existing methods on benchmark datasets, including Electricity, Traffic, Weather, and ETT. It is particularly effective on the largest dataset (Traffic), where it significantly outperforms the best Transformer-based model (PatchTST). In the M5 forecasting competition, TiDE, utilizing all covariates, surpasses DeepAR by as much as 20%.

Q: What are the advantages of TiDE in terms of efficiency?

A: TiDE is much more efficient than PatchTST in terms of both training and inference time, especially for long context lengths. This efficiency is due to the linear scaling of TiDE’s computational complexity with respect to context and horizon lengths, whereas PatchTST has quadratic scaling.

Q: What is the significance of the theoretical analysis presented in the paper?

A: The theoretical analysis provides insights into why simple linear models like TiDE can sometimes outperform more complex Transformer-based models in long-term forecasting. It proves that a linear version of TiDE can achieve near-optimal error rates for linear dynamical systems under certain assumptions.

Q: What are the potential future research directions related to TiDE?

A: Future research directions could include a more rigorous theoretical analysis of MLPs and Transformers for time-series data, exploring the use of pre-trained models for forecasting, and investigating the applicability of TiDE to other domains and tasks beyond time-series forecasting.

Conclusion

The paper concludes by highlighting the potential of simple MLP-based models like TiDE for long-term time-series forecasting. The authors suggest that self-attention mechanisms, which are prevalent in Transformer models, may not be necessary for capturing periodicity and trend patterns in this context. They also propose future research directions, such as a more rigorous theoretical analysis of MLPs and Transformers for time-series data and the exploration of pre-trained models for forecasting.