How We Used Transformers Neural Networks to Improve Time Series Predictions

Published in

Exness Tech Blog

10 min readJan 24, 2023

In this article, we will be discussing the use of transformer models for working with time series data. The information presented in this text is based on a talk I gave at the Linq conference. I will share the results of our experiments using transformer models and their comparison to gradient boosting techniques, key findings, and tips for using transformer models to improve performance on time series data tasks.

My name is Roman Smirnov, and I am a machine learning engineer at Exness. Prior to joining Exness, I worked extensively with natural language processing and audio processing, which are fields where transformer models are currently the state-of-the-art. I attempted to apply transformer-based models to sequences of data, specifically time series user data.

Tasks

To begin with, I’ll discuss the business case. Let’s consider solving some basic tasks, with a focus on predicting client value and identifying potential client churn.

Typically, there is an imbalance in the distribution of clients: a large number of clients tend to churn, and there is an unusual distribution of client value with a long tail.

There are traditional methods, also known as baselines, for tackling these tasks. Generally, these methods involve using gradient boosting on decision trees. We tried these methods and trained models on our data using gradient boosting on decision trees for both regression and classification tasks. However, the initial results were not satisfactory. For example, for the classification task, the model overpredicted in half of the cases for clients whose real value in 90 days was zero.

We also attempted to use classification but encountered another problem. In our case, we built a classification model for churn prediction using gradient boosting on decision trees, and approximately 22% of users who had high-value were incorrectly predicted as zero-value clients or potential churns.

Our question was: Can we use one model to predict both client churn and value with a combined loss (MAE and Binary Cross Entropy) while ensuring accurate predictions in terms of mean absolute error (MAE)?

To achieve this, we decided to use neural networks based on transformer architecture and saw promising results.

We used techniques from various fields such as natural language processing, computer vision, recommendation systems, and even audio processing to build the transformer model.

Gradient boosting is commonly used for structured data, while neural networks are considered state-of-the-art solutions for unstructured data like images, texts, or audio. But can time series data be unstructured? Yes, it can.

In some cases, time series data is highly structured, such as events like “a client bought ten items” or “client spent five minutes online” that always have the same features and different values. But in other cases, the data can be unstructured, like observing that a client bought one car and two dogs on Thursday and five cookies the next day. These different categories, features, and values led us to conclude that neural networks can be used for this task.

There are various neural network architectures that can handle recurrence and time series data, such as recurrent neural networks or convolutional neural networks. However, recent studies have shown that transformer-based architecture often outperforms both.

It’s important to keep in mind that neural networks are not a one-size-fits-all solution. You should consider using them when:

You have a complex loss function.
Your data is unstructured.
You have tried gradient boosting on decision trees as a baseline and are unable to improve it further.

Preprocessing

When using gradient boosting, many of us do not pay much attention to data pre-processing. However, when working with neural networks, pre-processing your data is crucial.

Standardization and normalization using minimaxing are traditional methods, but they only rescale the data. Another technique that is gaining popularity is called Gauss rank transformation. This technique is explained in more detail in the Nvidia article “Gauss Rank Transformation Is 100x Faster with RAPIDS and CuPy.”

Gauss rank transformation changes the distribution of the data, which is beneficial for neural networks. By using Gauss rank transformation instead of normalization or standardization, we were able to improve our model’s L1 score (MAE).

Architecture

Transformer architecture is a complex and large concept. There are two types of transformer models: encoder and decoder.

(There is also a third type, encoder + decoder model, which is beyond the scope of this discussion.)

Encoder models, like BERT, are widely used for natural language processing. On the other hand, decoder models, like GPT, are also used in natural language processing.

According to studies, decoder models require less data to perform well on time series data. They also require a higher learning rate during training compared to encoder models. However, if you have a lot of data, it’s better to choose an encoder model.

Both encoder and decoder models can contain multiple blocks and have inputs and output logits (predictions). The input for this model should be a sequence, so if we have time series data, all the time series are the input for this model. For example, if we have 90 days, it should be at least 90 tokens.

The problem with the output of this model is that if the length of our input is 90, the length of the output is also 90. To solve our regression or classification task, such as client value prediction or churn prediction, we need to do something with this output.

There are several techniques, mostly from natural language processing, to handle this problem. We can take the first token from this sequence, the last token, or take the mean of all 90 output tokens. During our experiment, we found that taking the mean of all tokens works the best.

The transformer decoder or encoder model is an architecture that includes repeating transformer blocks.

Since we have repeating blocks, we don’t have to take the output only from the last transformer block. We can take the output from multiple blocks, for example, the last two or four blocks. The limit is the number of blocks available.

By taking the output from all these blocks, we will get not just one sequence of 90 tokens, but multiple sequences of 90 tokens. Averaging or weighted averaging these outputs will result in a single sequence of 90 tokens, which allows us to tune our model more accurately and focus on different parts of the encoder or decoder model.

Research shows that when working with transformers, the lowest and highest layers process different features such as general or local features. Therefore, taking outputs from several blocks can greatly improve the results.

Activations

Typically, each transformer block includes multiple layers connected by non-linear activation functions. ReLU and GeLU are commonly used in many transformer models. However, in 2022, Google AI introduced a new variant called SwiGLU and claimed it was more effective than the popular GeLU or ReLU. But based on our experiments with various tasks and types of data, we found that it did not make a significant difference in most cases.

Ways to encode time series data

Time series data often comes in the form of tables. However, for our task, we needed to prepare the data differently.

When working with time series data, we usually have both static and dynamic data.

An example of static data would be a client’s age, as it doesn’t change daily. Similarly, a client’s country of residence would be static data. This data can be both categorical and numerical.

Dynamic data, on the other hand, can also be divided into several groups. For example, a client’s browsing history is categorical data, while the number of pages visited is numerical data.

To prepare the data for the transformer to process, we need to create an input sequence with a length and dimensions that the model can handle. For example, vanilla transformers process input with dimensions of 768 for every token in the input sequence. However, this can be challenging as our data is diverse and may not fit into these dimensions easily.

There are two ways to prepare the input data for the transformer, but one aspect is the same for both methods. In the slide, we can see an example of the “static” block. The static data doesn’t need to be repeated every day, it can simply be placed at the beginning of our input sequence. Thanks to the attention mechanisms used in transformers, this will work. For categorical features, we use an embedding layer to convert them into numerical values. Numerical features can also be projected to the required dimension using linear layers.

The resulting dimensions of the categorical and numerical features can be the same or different, depending on what you plan to do with them next. You can sum or concatenate them, as we did in our case. This will make them fit the model’s dimension of 768. If not, we can use linear transformation again to adjust the dimensions to the desired size.

As for dynamic data, we can follow a similar process as we did with static data. We fill the sequence with information in a “day-by-day” manner. We use embedding layers for categorical features and linear projections for numerical features. Then we put all the groups together into one element of the sequence, also known as a token.

Another approach is to organize the data in “group-by-group” format. For example, the first group can be browsing history and the second one can be purchase history. Each group is put into a separate element of the sequence. This will result in a longer input sequence.

It’s also important to add positional information to the transformer model to ensure that data in the first element of our sequence is processed differently compared to the data in the last element, even if the data is the same. If you have two groups, position encoding should be used twice, once for the first group and once for the second group.

In our case, we chose the group-by-group option as it gave better results.

Does size matter?

Our experiments have shown that the size of the transformer model does play a role in the performance. There are smaller transformer models with smaller dimensions and larger ones. It’s obvious that they require different amounts of time for training. The larger the model, the more hours it needs. However, we found that larger models perform better on outliers.

Pretrained hint

If you are familiar with transformer models, you may have heard of the concept of pretraining. Similar to natural language processing, we can train a model on a large dataset and then use that model to perform well on other data within the same domain. However, we took it even further.

A lot of the techniques that I discussed today are taken from natural language processing and specifically from Nvidia papers. For example, Nvidia suggests using transformer models to generate recommendations, and we applied this approach to work with time series data.

Nvidia suggests that we can use the architecture of popular transformer models like GPT, XLNet, or any other, to generate recommendations based on a large number of categorical and numerical features.

For our time series data, we even used the weights from a large transformer model that was pretrained on a huge dataset of text data and used them as the initial weights for fine-tuning our time series model. The surprising thing is that it worked quite well.

Key findings

We conducted a vast number of experiments using different combinations of architectures and pre-processing options. To summarize my talk, I would like to emphasize the key points discussed.

We compared the performance of neural networks to that of gradient boosting and found that using neural networks led to a significant reduction in mean absolute error for client value prediction. Specifically, our error dropped from 68 to 42 when using weights from the pretrained model RoBERTa.

To summarize, transformer models are a highly effective solution for working with time series data. We found that using neural networks, specifically transformer architectures, led to a significant reduction in mean absolute error for client value prediction. If you’re facing a similar task, we recommend giving transformer models a try as they have the potential to deliver impressive results. Thank you for reading.