Transformers for Time-Series Data

By Lars Ødegaard Bentsen (Data Scientist) at BearingPoint

BearingPoint Data, Analytics & AI

Published in

BearingPoint Data, Analytics & AI

9 min readDec 19, 2023

Since its release in 2017, the Transformer has brought with it a mini-revolution in the field of AI. By alleviating the need for recurrence, the Transformer was found to be able to learn longer contexts than traditional LSTM/RNN-based architectures.

Nevertheless, the Transformer did come with some limitations. Since the attention mechanism performs attention over all inputs, the model’s complexity scales quadratically with sequence length L². Other sequence learners, such as LSTMs, do not possess this drawback, with memory complexity that scales linearly with sequence length (O(L)). The quadratic memory growth of the Transformer might restrict the maximum sequence length of the inputs, depending on the readily available compute resources.

Furthermore, since the original Transformer (and much of the related work) focuses on natural language and computer vision applications, the original architecture does not necessarily work well on time-series data out-of-the-box.

In this article, we aim to provide a brief overview of some interesting work that adapts the Transformer architecture to facilitate time series data (and in particular forecasting) or that aims to reduce the model complexity (<O(L^2)). Since only a brief overview of the different architectures will be provided here, we suggest that the interested reader checks out the full papers as well. Happy reading, and remember, Attention is All you Need!

Attention mechanism and Transformer

Attention mechanisms were first introduced for machine translation, where the attention operation enabled a recurrent encoder-decoder architecture to retain long-context information better than architectures without attention. The nice thing about attention mechanisms is that they’re fairly intuitive and easy to understand. As an example, we could visualise the learned attention weights for an image-captioning model, to see which parts of the image the model learns to focus on when producing a particular word.

Figure 1: Attention weight illustration for image captioning. Taken from: Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” *International conference on machine learning*. PMLR, 2015.

From the image, we can see that the attention mechanism learns to focus on the correct parts of the image for different words (e.g. focusing on the central parts when producing the words bird and flying). Pretty cool!

Moving on to the Transformer architecture, the author’s idea was to design an architecture fully reliant on the attention mechanism, rather than combining attention with conventional convolutional or recurrent architectures. The overall architecture is visualised in the image below, along with an illustration showing how the scaled-dot-product attention works.

Figure 2: The Transformer Architecture from: Vaswani, Ashish, et al. “Attention is all you need.” *Advances in neural information processing systems* 30 (2017).

Figure 3: Illustration of the scaled dot-product attention. This great illustration is taken from another blog post: https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

We’re not going to dig deep into the Transformer architecture here. The interested reader may however check out the references at the end of this post or the Illustrated Transformer article here: https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

Some Prominent Architectures

If we were to visualise the attention pattern for the vanilla (“basic/original” in deep learning lingo) Transformer, we would require an LxL matrix to store all the attention weights. This is illustrated in the top-left of the figure below and results in quadratic memory growth.

Figure 4: Illustration of different attention patterns.

Figure 5: Another visualisation that shows the difference between windowed (right) and full (left) attention for an NLP application with a sentence input.

The first few architectures that we will look at, focus on alleviating the challenge of quadratic memory growth by introducing sparsity into the attention operation. This means that keys will no longer be able to attend to all other sequence positions in an update (as seen by the three other attention patterns in the figure). We will now provide a summary of the key features of some prominent Transformer architectures designed for time series or long-sequence data:

Longformer (referring to figure 4)

This alteration of the Transformer was designed to be able to work on very long documents, which prohibits the use of full self-attention (due to memory constraints).
Diluted windowed attention: Each sequence position can only attend to nearby positions, as shown in the bottom left of the figure above.
To learn long-term context, the authors also introduced global attention, where certain sequence positions can be attended to by all other keys (see the full paper for details).

LogSparse Transformer (referring to figure 6)

Convolutional Attention: The full self-attention of the Transformer uses position-wise linear transforms to compute the key/query representations, which are insensitive to local context. By leveraging 1D convolution, the key/query representations can now contain local information, such as mean/std for the local time interval or be able to identify increasing/decreasing trends. This might be very useful for time-series applications where you have high volatility and more chaotic signals than for natural language. One isolated time step may not provide as much information as a single word does within a sentence, due to the (typically) noisy nature of time series signals. This is a motivation behind employing convolutional attention for analysing time series data, as it can help to capture local context and relationships between different time steps.
LogSparse Attention: Some of the memory constraints are alleviated by introducing sparsity into the attention patterns (as for the Longformer) based on an exponential step schedule. The authors also introduce some local attention range and restart (as visualised in the figure below).

Figure 6: Illustration of full and LogSparse attention patterns on time series data. Taken from: Li, Shiyang, et al. “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.” *Advances in neural information processing systems* 32 (2019).

Informer (refering to figure 7)

Sparsity: Instead of having sparse attention patterns that are decided based on (simple) fixed heuristics, the Informer introduces the ProbSparse attention. The ProbSparse attention locates the most dominant queries and only allows keys to attend to these (based on the Kullback-Leilbler (KL) divergence against a uniform distribution). In essence, this method of introducing sparsity should not overlook potentially important time steps, which might be the case for fixed sparsity patterns.
Self-attention distilling: Highlights dominant attention by halving cascading layer inputs through 1D convolution and max pooling. If this is difficult to understand, the distilling makes the Informer much more efficient for very long sequences.

Autoformer (refering to figure 7 and 8)

Auto-Correlation: The Autoformer replaces the original scaled dot-product attention mechanism with the Auto-Correlation module (!). This operation uses keys and queries to decide on the most important time-delay similarities through autocorrelation and time-delay aggregation. This again, is expected to perform better for many time series applications, as it is specifically designed for time series similarity.
Signal decomposition: The Autoformer also fundamentally changes the vanilla Transformer architecture. We will not dive into the details here, but a key takeaway is the use of signal decomposition. Series decomposition is used to separate trend and seasonal components (important!) in the input signal and the Autoformer then performs computation on periodic (seasonal) signals to extrapolate into the future.
Overall, the Autoformer presented a shift in Transformers for time-series analysis, as it directly changed the Transformer architecture to facilitate time-series data, based on established concepts within signal processing.

Figure 7: Attention patterns for (a) — Vanilla Transformer, (b) — any sparse attention, e.g. ProbSparse from Informer, © — LogSparse Transformer attention, (d) — Auto-Correlation from Autoformer. Taken from: Wu, Haixu, et al. “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.” *Advances in Neural Information Processing Systems* 34 (2021): 22419–22430.

Figure 8: The Autoformer Architecture. Taken from: Wu, Haixu, et al. “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.” *Advances in Neural Information Processing Systems* 34 (2021): 22419–22430.

FFTransformer (refering to figure 9)

Decomposition and separate streams: The architecture uses wavelet decomposition to partition the original time series into trend and periodic components. The architecture then performs computation on the two types of signals, separately.
Trend Stream: Computation on trend (blue) signals is similar to the original Transformer.
Periodic stream: Fast Fourier transform (FFT) is applied to the periodic (green) signals before an attention mechanism updates signals in the frequency domain.
Output: Latent representations from the two streams are combined and passed through a linear layer to produce the final outputs.
Sparsity: Although the FFTransformer does not introduce any new methods for reducing the memory complexity, the architecture allows for any attention operation to be used (e.g. ProbSparse, LogSparse, windowed attention).

Figure 9: Example of decomposing a signal (black) into periodic (green) and trend (blue) components. Taken from: Bentsen, Lars Ødegaard, et al. “Spatio-temporal wind speed forecasting using graph networks and novel Transformer architectures.” *Applied Energy* 333 (2023): 120565.

Key takeaways

You have now learned some key concepts related to the Transformer architecture and things that you might consider when aiming to use Transformers for time-series data (or maybe other applications?). As a few final remarks, we can summarise what we’ve learned:

Introducing sparsity into the attention mechanism can help alleviate memory constraints. Sparsity could be based on either fixed heuristics (LogSparse, Windowed, Global+Diluted Windowed Attention) or by locating the dominant attention pairs (ProbSparse Attention, Auto-Correlation).

Time series are often fundamentally different from text-based data and one can potentially see huge improvements by leveraging traditional concepts from signal theory (Don’t forget your good old maths/engineering concepts!). For example, by decomposing a signal into trend and periodic components, it can be easier for a model to learn overall characteristics and not be confused by high volatility/noisy signals.

Specialized attention operations that operate in the frequency domain (Autoformer, FFTransformer, FEDformer), can help boost performance. These, as well as other methods such as Convolutional Attention, can help provide local context to the model. Such alterations might be important for time-series signals, where recordings are not necessarily very informative in isolation.

Get to know your time series data and see if there are any traditional concepts (e.g. signal decomposition) that can be used by the model to potentially boost performance! Furthermore, it is always good to do a sanity check before applying any deep learning model to see if it makes sense for the model to learn the desired characteristics, e.g. convolutional attention (from LogSparse Transformer) can help with local context.

About the author

Lars Ødegaard Bentsen is a Consultant in the Data Science and AI team at BearingPoint Oslo. He recently handed in his PhD in DL applied to offshore wind energy, where he mainly focused on spatio-temporal time series forecasting. Email address: lars.bentsen@bearingpoint.com

Side Note/Disclaimer: You might also check out some of the references for additional architectures that we excluded from this blog post for brevity. The field is also moving quite rapidly, so do not use this post as an exhaustive list of architectures, but rather as a soft guide into Transformers for time series!

References

Original Attention Paper: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
Image Captioning with Attenion: Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” International conference on machine learning. PMLR, 2015.
Transformer Paper: Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
Longformer Paper: Beltagy, Iz, Matthew E. Peters, and Arman Cohan. “Longformer: The long-document transformer.” arXiv preprint arXiv:2004.05150 (2020).
LogSparse Paper: Li, Shiyang, et al. “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.” Advances in neural information processing systems 32 (2019).
Informer Paper: Zhou, Haoyi, et al. “Informer: Beyond efficient transformer for long sequence time-series forecasting.” Proceedings of the AAAI conference on artificial intelligence. Vol. 35. №12. 2021.
Autoformer Paper: Wu, Haixu, et al. “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.” Advances in Neural Information Processing Systems 34 (2021): 22419–22430.
FFTransformer Paper: Bentsen, Lars Ødegaard, et al. “Spatio-temporal wind speed forecasting using graph networks and novel Transformer architectures.” Applied Energy 333 (2023): 120565.
Reformer Paper: Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efficient transformer.” arXiv preprint arXiv:2001.04451 (2020).
FedFormer Paper: Zhou, Tian, et al. “Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting.” International Conference on Machine Learning. PMLR, 2022.