How long dependencies can LSTM & T-CNN really remember?

Published in

Analytics Vidhya

5 min readNov 10, 2021

Disclaimer: This article assumes that readers possess preliminary knowledge behind the model intuition and architecture of LSTM and CNN neural networks.

LSTMs are the widely used techniques for sequential modeling tasks such as language modeling and time series forecasting. Such tasks usually have Long-term patterns and Short-term patterns and hence it’s important to learn both the patterns for accurate predictions and estimations. There is a rise in Transformers-based techniques that help to model Long-Term dependencies far better than LSTMs however, transformers cannot be used for every application due to data-hungry training requirements and deployment complexity. In this post, I will compare LSTM and T-CNN in terms of their long-timescale information learning.

Let’s start….

Technical TLDR

LSTM is a Long Short Term Memory Neural Network widely used to learn sequential data (NLP, Time Series Forecasting, etc..). Since Recurrent Neural Network (RNN) suffers from vanishing gradient problems blocking the network to learn long-timescale dependencies while, LSTM reduces this problem by introducing forget gate, input gate, and output gate. With these gates, it has cell states which represent long-term memory while hidden states represent short-term memory. Sadly LSTM is still not a perfect solution for retaining long-term information as forget gates tend to remove such patterns from prior steps (Information Decay) — if the pattern is not important for 50 steps why should I keep it?

Power Law Forget Gated LSTM (pLSTM) is developed recently by researchers at Intel and John Hopkins. Despite improvements in LSTM, the forgetting mechanism exhibits exponential decay of information, limiting their capacity to capture long-timescale information. You can read the paper for a detailed explanation. In summary, since Information Decay follows an exponential pattern in LSTM, pLSTM adds a decay factor p to the forget gate which allows LSTM to control the information decay rate helping it to better learn long-term dependencies.

Temporal CNN (T-CNN) are simple 1D Convolutions networks that can be applied to time-series data instead of the image data. These layers are known to have temporal properties to learn global and local patterns in the data. Convolution layers also help to improve model latency since predictions can be parallelized and do not need to be in a sequence. As CNN can be causal meaning each prediction can only depend upon its previous predictions there is no leakage from the future. TCNs pose very long effective history sizes using combinations of deep neural networks and dilated convolutions. There are several variations of T-CNN such as Attention-based CNN, combining LSTM with CNNs, and fusing other types of architecture, however, in this post I will stick to vanilla T-CNN to keep it simple for my readers, and variations in TCNN in itself can be a separate Blog Post. You can read the paper for a detailed explanation.

Causal vs Standard Convolutions that form the basis of T-CNN. Image by author

Synthetic data

Let’s start with a simple additive function:

y = f(xⁿ)+ f(x¹)

One can design their own function, but for now, let’s go with this one. Our hypothesis is that the longer the sequence is LSTM should have harder time remembering X⁰ value, for example, a sequence with 2 is an addition of 1st and 2nd element while a sequence of 100 is an addition of 1st(0th) and 100th(or -1th) element. The larger the sequence is there are more steps in between where LSTM needs to carry the information.

Data Generator by author

LSTM Model Architecture

I created a vanilla LSTM architecture and experimented with hyper-parameters as well as stacked LSTM layers to validate our hypothesis

Vanilla LSTM architecture and training by author

CNN Model Architecture

I created a vanilla T-CNN architecture and experimented with hyper-parameters such as kernel_size, filters, and number of convolutions to validate our hypothesis

Vanilla T-CNN architecture and training by author

Model Performance Results

MSE is a great measure to evaluate non-skewed data however, not very straightforward to interpret (except mathematicians can !). To simplify we can look at % change in Y when one of the X is changed. Since the synthetic function is additive, changing one of the X by N percent should have an N/2 percent change in Y. Second we can look at blinding. When we set either of the X to 0 we can measure if Y is equal to non-zero X in terms of MSE. Theoretically, MSE should be 0.0

metrics by author

Results and Conclusion

I applied 50% change to X[0] and X[-1] independently to understand the change in Y and compute 0th change and -1st change columns. Up to 68 sequence length LSTM was able to remember prior information from the start of the time series but eventually forgets at 69th sequence length, while CNN was still able to make accurate predictions just due to the fact that I increased the Kernel size so that it has a chance to look at the initial and end of series values. The same results are shown in blinding. When 0th value is set to 0 for sequence length 69, MSE jumps to 0.251%. Overall LSTM also seems to be more sensitive towards -1th change than 0th change. [These results would change with different data size and functions, but the message is clear “LSTMs are not really Long!”]

Results by author

These results are along the same path as described by pLSTM authors in their information decay section

In this article, I haven’t compared to pLSTM since there is no open-source implementation available yet however, I will follow up with the next blog with my own pLSTM implementation.

*Here, you can find the link to the complete code.

Cite at:

@article{dkatariya2021LSTM,
  title={How long dependencies can LSTM & T-CNN really remember?},
  author={Katariya, Dwipam},
  journal={Medium, Analytics Vidya},
  volume={1},
  year={2021}
}

How long dependencies can LSTM & T-CNN really remember?

Technical TLDR

Written by Dwipam Katariya