Demystifying LSTM Weights and Bias Dimensions.

Gursewak Singh
Analytics Vidhya
Published in
4 min readMar 4, 2020
Photo by Volodymyr Hryshchenko on Unsplash

LSTM (Long Short Term Memory) is a variant of Recurrent Neural Network architecture (RNNs). LSTM solves the problem of vanishing and exploding gradients during backpropagations. This is achieved by using a memory cell. In this post, we will discuss the weight and bias dimensions of the LSTM cell. For understanding the LSTM, you can refer here or here.

Architecture of LSTM

Figure 1: LSTM Cell

In the LSTM figure, we can see that we have 8 different weight parameters (4 associated with the hidden state(cell state) and 4 associated with the input vector). We also have 4 different bias parameters. To better understand this we can use the following equations and better understand the operations in LSTM cell.

Figure 2: LSTM equations

Here, with the help of the above equations, we can clearly see a total of 4 biases and 8 weights. Let's take an example.

Seq_len of the input sentence (S)= 12
embedding dimension (E)= 30
No of LSTM cells (hidden units) (H)= 10
Batch_size (B) = 1

The input (x) will be batch size * embedding dimension = B*D
The previous hidden state will be batch size * hidden units = B*H

Equation 1: forget gate = [(1*10).(10*10)+(1*30).(30*10) + (1*10)]
= (1*10) = (B*H)
Equation 2: update gate = [(1*10).(10*10)+(1*30).(30*10) + (1*10)]
= (1*10) = (B*H)
Equation 3: candidate memory=[(1*10).(10*10)+(1*30).(30*10)+(1*10)]
= (1*10) = (B*H)
Equation 4: output gate =[(1*10).(10*10)+(1*30).(30*10) + (1*10)]
= (1*10) = (B*H)

Since all weights follow the same structure these can be combined together can then multiplied with the respective output. weights associated with hidden state are called kernel weights and weights associated with input are called recurrent kernel weights.

Note:
1. Since LSTM processes data in sequential nature. It will receive 1 word at a time and the same LSTM cell will receive the next subsequent words. No. of LSTM cell doesn’t mean that many times LSTM is repeated. It means it can be unfolded up to the sequence length. In the actual LSTM cell, the same cell will receive all the words one by one.
2. Sequence length does not have any effect on the weights and bias dimension. It can be clearly seen in the above calculations.
3. Weight is multiplied by taking the transpose of weight, but here I have rearranged weight and input for simplification.

To see all the weights and bias dimensions, I have put them in a table and named them accordingly as per equations.

Figure 3: Dimensions of each parameter in LSTM

Compare with Tensorflow implementation

In the following code snippet, I have implemented two LSTM layers. The first one we have discussed already and the below results are for the first LSTM layer only. You can verify your understanding with the second LSTM layer. Homework :P

Figure 4: code snippet for LSTM in Tensorflow
Figure 5: Output of the code snippet

Here we can clearly see we have the same dimensions for each weight and bias. So, now we can also easily relate to the formula to calculate the no of parameters in LSTM cell i.e.

No of parameters = 4 × [h(h+e) + h] = 4(10(10+30)+10) = 1640.
where h = no. of hidden units in LSTM
and e = embedding dimension of Input

References:

I hope you liked this short article. Follow me on GitHub and Kaggle. Feel Free to reach out to me at gskdhiman@gmail.com. Suggestions for any betterment of this article is highly welcome.

--

--