Understanding the flow of information through LSTM cell
1. Introduction
Several posts discussing the architecture and benefits of LSTM network are available, still understanding the processing and flow of information through the LSTM network is a complex task. This post is intended to discuss and understand the flow of information through the LSTM network. Most of the discussion is concerned with focus on elaborating the size of vector that flow through the network.
This article covers following;
- LSTM architecture
- Flow of information through memory cell
2. LSTM
Long Short-Term Memory (LSTM) is a class of recurrent neural networks (RNNs), capable of learning long distance temporal dependencies. They are most suited for the study of problems involving time series data such as stock market prediction, decoding speech signals and in recent time for the study of biological sequences such as DNA or protein sequences. The usage of LSTM network for modelling of protein sequences have been shown in paper.
LSTM are long chain of repeating units called memory cells. A small buffer element called cell state C_t is the key to the network and passes through each memory cell in the chain. As the cell state passes through each memory cell, its content can be modified using special logics namely, forget logic and input logic. This modification depends upon X, which is a concatenation of the output of previous hidden state h(t_1) and the current input x_t:
X = concatenation[h(t-1), x_t]
To understand the processing and flow of information through LSTM cell, assume that the dimension of each input step x_t is 32, while that of hidden state h(t-1) is 70 as shown in Figure 1.


It is interesting to note that the size of vector for the hidden state, h_t and cell state, C_t at any time step t actually depends upon the hyper-parameter i.e., “number of neurons” at each layers in different gates. Layers within gates are discussed next.
The concatenation operation between h(t-1) and x_t will generate a vector X of size 102. This is the size of vector that is input to all the gates. Note that all the values are taken only for demonstration purpose only.
2.1. Forget Gate
A forget gate is used to remove the irrelevant information from the cell state C(t-1) as new input x_t is encountered at t-th time step. The gate is composed of a single neural network layer with sigmoid activation function , which acts as a filter and produces a value in range [0,1] for each element in a cell state. The “number of neurons” in the layer is equal to 70, which is the dimension of a vector for each time step.

The value represents the component of each element (i.e., information) to let through.
Mathematically, it is described as:
f_t = sigmoid(W_f * X + b_f )
Here, W_f denotes weight matrix (70 x 102) and b_f denotes the bias vector (70 x 1) with subscript term f indicating forget gate.
The output from the forget gate is a vector of size 70.
2.2. Input Gate
The input logic adds new information to the cell state. It is composed of two parallel neural network layers with sigmoid and tanh activations respectively as shown in Figure 3. Note, “number of neurons” in both the layers is equal to 70. Sigmoid does filtering and tanh creates a vector of new elements.

Mathematically it is described as:
i_t = sigmoid(W_i * X + b_i)
C_t` = tanh(W_c * X + b_c)
where, W_i and W_c denotes weight matrices of size (70 x 102). Similarly, b_i and b_c denote bias vectors of size (70 x 1). The matrix notations are similar to forget gate.
The final output from the input gate is the element wise multiplication between i_t and C_t`, which is again a vector of size 70.
2.3. Update Cell State
The following equation is applied to update the information of a cell state C_t by removing the information filtered out using forget logic and adding new information using input logic. It is given as

The updated cell state C_t is a vector of size 70.
2.4. Output Gate
The hidden state h_t at each memory cell is decided based on the updated cell state C_t and the output vector o_t. Similar to layers in the forget and input gates, here also “number of neurons” is fixed to 70.

The output logic is composed of a single neural network layer having sigmoid function as a non-linear activation. This is shown in Figure 4. Size of the output vector o_t is 70. They are described as:
o_t = sigmoid(W_o * X + b_o)
Here, W_o and b_o denotes weight matrix (70 x 102) and bias vector (70 x 1) respectively with subscript term o indicating output gate.
Finally, element-wise multiplication between output o_t and cell state C_t is carried out to obtain the hidden state vector, h_t.
The hidden state h_t is a vector of size 70.
3. Python Implementation
Keras based implementation of LSTM network accepts the “number of neurons” hyper-parameter, which is also referred as number of hidden units in keras documentation as seen in code part.

The model summary reports that final output from LSTM layer is a vector of size 140 due to its bidirectional implementation.

4. Conclusion
This post discusses the flow of information in terms of vector through the LSTM cell.
This post was inspired by them.
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Deepak, A. and Tripathi, S., 2019. Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences. IEEE/ACM transactions on computational biology and bioinformatics.
