Demystifying LSTM Weights and Bias Dimensions.
LSTM (Long Short Term Memory) is a variant of Recurrent Neural Network architecture (RNNs). LSTM solves the problem of vanishing and exploding gradients during backpropagations. This is achieved by using a memory cell. In this post, we will discuss the weight and bias dimensions of the LSTM cell. For understanding the LSTM, you can refer here or here.
Architecture of LSTM
In the LSTM figure, we can see that we have 8 different weight parameters (4 associated with the hidden state(cell state) and 4 associated with the input vector). We also have 4 different bias parameters. To better understand this we can use the following equations and better understand the operations in LSTM cell.
Here, with the help of the above equations, we can clearly see a total of 4 biases and 8 weights. Let's take an example.
Seq_len of the input sentence (S)= 12
embedding dimension (E)= 30
No of LSTM cells (hidden units) (H)= 10
Batch_size (B) = 1
The input (x) will be batch size * embedding dimension = B*D
The previous hidden state will be batch size * hidden units = B*H
Equation 1: forget gate = [(1*10).(10*10)+(1*30).(30*10) + (1*10)]
= (1*10) = (B*H)Equation 2: update gate = [(1*10).(10*10)+(1*30).(30*10) + (1*10)]
= (1*10) = (B*H)Equation 3: candidate memory=[(1*10).(10*10)+(1*30).(30*10)+(1*10)]
= (1*10) = (B*H)Equation 4: output gate =[(1*10).(10*10)+(1*30).(30*10) + (1*10)]
= (1*10) = (B*H)
Since all weights follow the same structure these can be combined together can then multiplied with the respective output. weights associated with hidden state are called kernel weights and weights associated with input are called recurrent kernel weights.
Note:
1. Since LSTM processes data in sequential nature. It will receive 1 word at a time and the same LSTM cell will receive the next subsequent words. No. of LSTM cell doesn’t mean that many times LSTM is repeated. It means it can be unfolded up to the sequence length. In the actual LSTM cell, the same cell will receive all the words one by one.
2. Sequence length does not have any effect on the weights and bias dimension. It can be clearly seen in the above calculations.
3. Weight is multiplied by taking the transpose of weight, but here I have rearranged weight and input for simplification.
To see all the weights and bias dimensions, I have put them in a table and named them accordingly as per equations.
Compare with Tensorflow implementation
In the following code snippet, I have implemented two LSTM layers. The first one we have discussed already and the below results are for the first LSTM layer only. You can verify your understanding with the second LSTM layer. Homework :P
Here we can clearly see we have the same dimensions for each weight and bias. So, now we can also easily relate to the formula to calculate the no of parameters in LSTM cell i.e.
No of parameters = 4 × [h(h+e) + h] = 4(10(10+30)+10) = 1640.
where h = no. of hidden units in LSTM
and e = embedding dimension of Input
References:
I hope you liked this short article. Follow me on GitHub and Kaggle. Feel Free to reach out to me at gskdhiman@gmail.com. Suggestions for any betterment of this article is highly welcome.