PatchTST for Time Series Forecasting: Original Results and My Single-Channel Experiments

Lalf
8 min readMay 17, 2023

Motivation

Transformers have become a popular choice for time series forecasting due to their ability to capture complex temporal relationships. However, recent studies have shown that simple deep neural network (DNN) models outperform Transformer models like FEDFormer and Informer in certain time series forecasting tasks[1]. I also wrote an article explaining the results of the experiment.

To address this issue, a new Transformer architecture called PatchTST was introduced in a recent paper, “A TIME SERIES IS WORTH 64 WORDS: LONG-TERM FORECASTING WITH TRANSFORMERS”[2]. The PatchTST model incorporates two key concepts for time series forecasting: channel independence and patching.

Channel independence involves decomposing multichannel sequences into single channels before input to the model, allowing for greater flexibility and scalability in handling different types of data. Patching divides the input sequence into smaller parts, or patches, allowing the model to focus on local patterns and correlations.

In the original PatchTST paper, the authors evaluated the effectiveness of the model on multichannel time series forecasting tasks where both the input and output data are multichannel. To demonstrate the superiority of PatchTST over other models, the authors compared its performance with that of a simple DNN model, DLinear that outperformed Transformer models in a previous study.

Inspired by this, I extended the application of PatchTST to a single-channel time series forecasting task with multichannel input data and single-channel output data. In this article, I will explore the PatchTST architecture and implementation of channel independence and patching. I will also discuss the results of my experiments and compare the performance of PatchTST with simple DNN models. My results demonstrate that PatchTST outperforms these models, further highlighting its potential for a wide range of time series forecasting applications.

Key concepts of PatchTST

Mainly, there are two important concepts in the PatchTST model, “patching” and “channel independence”.

Patching

Patching is a technique to alleviate the computational burden of self-attention. Rather than attending to every position in the sequence, the input sequence is partitioned into smaller sub-sequences known as patches. Self-attention is then computed between the patches. This approach enables the model to handle longer sequences while avoiding memory constraints and facilitating quicker inference. Additionally, patching enables the capture of localized semantic information that can not be available when using individual point-wise input tokens.

Conceptual diagram of the entire PatchTST and patching
By setting patch length(P) = 4 and stride(S) = 2 with sequence length(L) = 8, four patches are generated as shown in this figure.

Channel independence

Channel independence, on the other hand, is a property of the PatchTST model that allows different channels of the input to be processed independently. In traditional transformers, the same set of attention weights is used for all channels, which limits the model’s ability to capture fine-grained information in each channel. In contrast, the PatchTST model applies attention weights separately to each channel, allowing it to better capture the unique features and patterns in each channel.

Together, patching and channel independence make the PatchTST model a powerful tool for processing long sequences with multiple channels, such as long time series forecasting tasks. By dividing the input into patches and processing each channel independently, the model can efficiently capture complex patterns and relationships across the entire sequence.

Results in the original paper

In the original paper, the authors conducted an extensive experiment to evaluate the performance of the PatchTST model on several datasets. They experimented with four patterns of output sequence lengths and compared the results with DLinear.

The best results are in bold and the second best are underlined.

The results showed that the PatchTST model outperformed the DLinear model significantly in terms of accuracy. Additionally, PatchTST was found to be superior to other Transformer models as well. The authors observed that the PatchTST model’s ability to capture long-range dependencies and its channel independence property played a critical role in achieving better results. These findings suggest that PatchTST is a promising model for time series forecasting tasks, and its application can lead to significant improvements in prediction accuracy.

My experiments with a single output channel

In this part, I will discuss my own experiments with PatchTST. For this, I chose the ETT dataset[3] and focused on a single output channel, specifically the Oil Temperature column. In contrast to the original paper where the output sequences were multichannel, I reconstructed PatchTST to output a single channel sequence. The reconstruction was a straightforward process and involved modifying the last fully connected layer.

# Implementation of the original final layer
class Flatten_Head(torch.nn.Module):
def __init__(self, n_vars, nf, target_window, head_dropout=0):
super().__init__()
self.n_vars = n_vars
self.flatten = torch.nn.Flatten(start_dim=-2)
self.linear = torch.nn.Linear(nf, target_window)
self.dropout = torch.nn.Dropout(head_dropout)

def forward(self, x): # x: [bs x nvars x d_model x patch_num]
x = self.flatten(x) # x: [bs x nvars x d_model *patch_num]
x = self.linear(x) # x: [bs x nvars x target_window]
x = self.dropout(x)
return x

# Implementation for single channel output
class Flatten_Head_For_Single_Output(torch.nn.Module):
def __init__(self, n_vars, nf, target_window, head_dropout=0):
super().__init__()
self.n_vars = n_vars
self.flatten = torch.nn.Flatten(start_dim=-3)
self.linear = torch.nn.Linear(nf * n_vars, target_window)
self.dropout = torch.nn.Dropout(head_dropout)

def forward(self, x): # x: [bs x nvars x d_model x patch_num]
x = self.flatten(x) # x: [bs x nvars * d_model * patch_num]
x = self.linear(x) # x: [bs x target_window]
x = self.dropout(x)
return x

#Implementation of PatchTST
class PatchTST(torch.nn.Module):

def __init__(self, c_in, context_window, target_window, patch_len, stride, max_seq_len=1024,
n_layers=3, d_model=16, n_heads=4, d_k=None, d_v=None,
d_ff=128, attn_dropout=0.0, dropout=0.3, key_padding_mask="auto",
padding_var=None, attn_mask=None, res_attention=True, pre_norm=False, store_attn=False,
head_dropout = 0.0, padding_patch = "end",
revin = True, affine = False, subtract_last = False,
verbose=False, target_idx=-1, **kwargs):
super().__init__()

self.revin = revin
if revin:
self.revin_layer = RevIN(c_in, affine=affine, subtract_last=subtract_last, target_idx=target_idx)

self.patch_len = patch_len
self.stride = stride
self.padding_patch = padding_patch
patch_num = int((context_window - patch_len)/stride + 1)

if padding_patch == "end":
self.padding_patch_layer = torch.nn.ReplicationPad1d((0, stride))
patch_num += 1

self.backbone = TSTiEncoder(c_in, patch_num=patch_num, patch_len=patch_len, max_seq_len=max_seq_len,
n_layers=n_layers, d_model=d_model, n_heads=n_heads, d_k=d_k, d_v=d_v, d_ff=d_ff,
attn_dropout=attn_dropout, dropout=dropout, key_padding_mask=key_padding_mask, padding_var=padding_var,
attn_mask=attn_mask, res_attention=res_attention, pre_norm=pre_norm, store_attn=store_attn,
verbose=verbose, **kwargs)

self.head_nf = d_model * patch_num
self.n_vars = c_in

self.head = Flatten_Head_For_Single_Output(self.n_vars, self.head_nf, target_window, head_dropout=head_dropout)


def forward(self, z): # z: [bs x seq_len × nvars]
# instance norm
if self.revin:
z = self.revin_layer(z, "norm")
z = z.permute(0,2,1) # z: [bs x nvars × seq_len]

# patching
if self.padding_patch == "end":
z = self.padding_patch_layer(z)
z = z.unfold(dimension=-1, size=self.patch_len, step=self.stride) # z: [bs x nvars x patch_num x patch_len]
z = z.permute(0,1,3,2) # z: [bs x nvars x patch_len x patch_num]

# model
z = self.backbone(z) # z: [bs x nvars x d_model x patch_num]
z = self.head(z) # z: [bs x target_window]

# denorm
if self.revin:
z = self.revin_layer(z, "denorm")
return z

To evaluate the performance of the single channel PatchTST model, I compared it to both Linear and DLinear, which are implemented as follows.

#Linear model
class Linear(torch.nn.Module):
def __init__(self, c_in, context_window, target_window):
super().__init__()
self.c_in = c_in
self.context_winsoq = context_window
self.target_window = target_window

self.flatten = torch.nn.Flatten(start_dim=-2)

self.linear = torch.nn.Linear(c_in * context_window, target_window)

def forward(self, x): # x: [bs x seq_len × nvars]
x = self.flatten(x) # x: [bs x seq_len * nvars]
x = self.linear(x) # x: [bs x target_window]
return x


class moving_avg(torch.nn.Module):
def __init__(self, kernel_size, stride):
super().__init__()
self.kernel_size = kernel_size
self.avg = torch.nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)

def forward(self, x):
# padding on the both ends of time series
front = x[:, 0:1, :].repeat(1, (self.kernel_size - 1) // 2, 1)
end = x[:, -1:, :].repeat(1, (self.kernel_size - 1) // 2, 1)
x = torch.cat([front, x, end], dim=1)
x = self.avg(x.permute(0, 2, 1))
x = x.permute(0, 2, 1)
return x


class series_decomp(torch.nn.Module):
def __init__(self, kernel_size):
super().__init__()
self.moving_avg = moving_avg(kernel_size, stride=1)

def forward(self, x):
moving_mean = self.moving_avg(x)
res = x - moving_mean
return res, moving_mean


#DLinear model
class DLinear(torch.nn.Module):
def __init__(self, c_in, context_window, target_window):
super().__init__()
# Decompsition Kernel Size
kernel_size = 25
self.decompsition = series_decomp(kernel_size)
self.flatten_Seasonal = torch.nn.Flatten(start_dim=-2)
self.flatten_Trend = torch.nn.Flatten(start_dim=-2)

self.Linear_Seasonal = torch.nn.Linear(c_in * context_window, target_window)
self.Linear_Trend = torch.nn.Linear(c_in * context_window, target_window)

def forward(self, x):
# x: [Batch, Input length, Channel]
seasonal_init, trend_init = self.decompsition(x)
seasonal_init = self.flatten_Seasonal(x)
trend_init = self.flatten_Trend(x)

seasonal_output = self.Linear_Seasonal(seasonal_init)
trend_output = self.Linear_Trend(trend_init)

x = seasonal_output + trend_output
return x

While I did not perform an extensive hyperparameter tuning, I did search for the optimal learning rate for each model.

The results of my experiments showed that even in the single output time series task, the single channel PatchTST model outperformed the Linear and DLinear models.

MSE error for each model

In addition to comparing the performance of single channel PatchTST, Linear and DLinear on the ETT dataset, I also analyzed the learning process of each model. Using the validation dataset, I plotted the accuracy of each model over the course of the learning epochs.

The plot shows that PatchTST is not only more accurate but also more stable during the learning process. This indicates that PatchTST is able to learn from the data more effectively and consistently than Linear and DLinear. Overall, these findings suggest that PatchTST is a powerful and reliable tool for time series forecasting tasks, even when applied to single channel sequences.

Conclusion

In conclusion, the PatchTST model offers a promising approach to time series forecasting tasks. By utilizing the concepts of patching and channel independence, it achieves better results than simple DNN models.

In this article, I have seen the key concepts of PatchTST and its superiority in multichannel time series forecasting tasks. Additionally, I presented the results of the original paper and my own experiments on the ETT dataset, demonstrating the effectiveness of PatchTST even in single output time series tasks.

Overall, PatchTST has the potential to revolutionize time series forecasting and should be considered as an alternative to traditional DNN models and other transformer-based models. Further research and experimentation can continue to explore its capabilities and applications in real-world scenarios.

The entire code used in these experiments is shown below, and I encourage everyone to try it.

References

[1] A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are Transformers Effective for Time Series Forecasting?,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2023.

[2] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers,” in Proceedings of the International Conference on Learning Representations, 2023.

[3] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35, no. 12, pp. 11106–11115, 2021.

--

--

Lalf

Data Scientist for a global e-commerce company