Exploiting hidden vectors in Long short-term memory (LSTM) networks for stock trading prediction

Published in

Analytics Vidhya

5 min readAug 12, 2020

On this article we explain the second of the three novel contributions behind the paper “High-performance stock index trading using neural networks and trees”, which achieves very strong results when applied to major stock market indices such as S&P500, Dow Jones Industrial, Russell 2000 and NASDAQ. The first novel contribution was the introduction of a trading strategy that utilizes the distribution of predicted returns covered here.

We go on to introduce a variant architecture of long short-term memory (LSTM) network in order to maximize the information from input data and provide more accurate predictions. This variation is quick to implement and execute while has shown to produce more accurate predictions (i.e. less error) than more complex methods. The explanations below are rather high level, so some familiarity with LSTMs would benefit the reader to better understand what is going on. For the curious reader we refer to the paper for the exact details. Let’s dive right into it!

A. Brief introduction to LSTMs

Long short-term memory is a specific type of neural network, specifically, a recurrent neural network (RNN) created to deal with sequence input. What makes the sequence data interesting, is the presence of the time dimension. Hence, RNNs (and subsequently LSTMs) are able to incorporate time which makes them suitable for time-series problems. LSTMs, first introduced by Sepp Hochreiter and Jürgen Schmidhuber (you can find their paper here), were (and to some extent they are) the fueling force behind many great innovations and breakthroughs in Natural Language Processing (NLP), particularly machine translation (for example have a look at google’s neural machine translation system). A great introduction is given in Colah’s blog, which is worth checking out.

a. What makes LSTMs special?

In a nutshell, it’s all about how it processes data, as those become available to the network, having the“ability” to “remember” and “forget”. These are implemented through specific mathematical transformations, known as hidden (h) and cell (c) states. These states make up the LSTM cell. We allow the network to retain information from previous time-steps and combine it with the current one. What that all means is that, as data are fed into the network, some information is discarded while the one that is kept is combined with the new information as that enters the network.

B. Proposed Architecture

At its core, the basic LSTM cell (whose mathematical description can be found, for example, here) consists of various (mainly) nonlinear transformations involving

its time-varying hidden state, h_t ,
an internal memory state, c_t , and,
a batch of input vectors, x_t

which the cell receives at each time t = 0, 1, 2, .… The input, x_t , together with the previous hidden state h_(t−1) and other internal cell data, produce the next instances of the hidden and memory states via both feed-forward and recurrent connections. So, at each time step, we have the following output:

h_t, which serves both as an output of the LSTM cell and input to the LSTM cell for the next time-step, as we have already mentioned, and
c_t, which is the input for the LSTM cell for the next time-step.

The output h_t can either be the final output of the network or in case of multiple layers, subsequently be fed to another LSTM cell (i.e. instead of the x_t). Finally, the last hidden vector is fed (usually) through a linear fully connected layer to produce the desired predictions. In the case of sequence to sequence training, the output is a vector of values, one for each time step. Figure 1 below depicts all the above and hopefully clarifies the process. Up until now everything mentioned is standard. The common practice is to use the last hidden vector to produce the predictions.

Fig 1. LSTM architecture, showing how the inputs [x_t, h_t, c_t] are combine to produce output [h_(t+1), c_(t+1)]. while using only the last hidden state to produce the output sequence

The question we asked, is why use only the last hidden vector and not all or some of them?

Sure, the hidden and cell states “remember” and “forget” information as that flows through the network, however, what information are we missing by not considering those hidden vectors? The use of the last state only, essentially implies an expectation that all ‘‘useful’’ information present in the sequence will be encoded into the last hidden state.

b. Utilizing the entire sequence of hidden states

Instead, we proposed to use all hidden states to produce the desired predictions. The entire sequence of hidden states are stacked together into one matrix which is then fed to the fully connected layer, see Figure 2 below. Utilizing the entire sequence will allow us to look for information in the time-evolution of the hidden state, which has proven to be beneficial. By doing so, we let once again, the network decide which information is more relevant for producing our predictions. As an example, when applying this network on S&P500 for a period of ten years (2010-2019), the mean average percentage error of the predictions was close to 0.85% (as a side note, in a future article I will explain why error metrics should not monopolize our focus).

Fig 2. The LSTM network using all hidden vectors. The hidden vectors are reshaped and fed through a linear dense layer which produces the sequence predictions.

One of the benefits of this approach is the need for far less input data to train the network. This is a part of future research, however, we are able to train our network using very minimal batch sizes. Especially, when compared with methods that do heavy pre-processing of the data, it was able to match or, even, beat their predictions. This result also highlights the importance of choosing the correct architecture for the network, apart from tuning each hyper-parameters.

C. Conclusion

We presented a variation for LSTM networks, which utilizes all hidden vectors to produce the desired predictions. This variation proved to be quite effective, when tested and measured against other more complex approaches by providing the same or even better predictions, by means of by standard error metrics (such as mean average error). Moreover, it remains to be scientifically proven, it requires far less data to train, a crucial benefit when working with stock level data.

Future research includes looking into a quantitative way for measuring how many hidden states should be used. Using all provides strong results but perhaps a subset of those might yield even better!