And of course, LSTM — Part II

Deriving the gradients for Backward propagation in LSTM

Eniola Alese
ExplainingML
4 min readJun 11, 2018

--

The first part of the post can be found here.

Back propagation in LSTM Unit

In practice, back propagation is automatically implemented using deep learning libraries like PyTorch, TensorFlow etc, but in order to gain a better understanding of the LSTM model we would derive its gradients analytically.

Backpass in LSTM

Step 1: We define the loss function for cost computation. The choice of loss function is usually based on the task at hand, in this case we use the cross entropy loss function for multi-class outputL⟨t⟩ :

Step 2: We work our way backwards by computing the partial derivative of the predicted output activation (softmax function) ŷ⟨t⟩ with respect to loss L⟨t⟩.

Step 3: Compute the partial derivative of predicted output z⟨t⟩ with respect to loss L⟨t⟩ .

Step 3: Compute the partial derivative of hidden state activation a⟨t⟩ with respect to loss L⟨t⟩ .

Step 4: Compute the partial derivative of the output gate cell o⟨t⟩ with respect to loss L⟨t⟩ .

Step 5: Compute the partial derivative of output gate weights W_o with respect to loss L⟨t⟩ .

Step 6: Compute the partial derivative of the memory cell c⟨t⟩ with respect to loss L⟨t⟩ .

Step 7: Compute the partial derivative of the temporary memory cell č⟨t⟩ with respect to loss L⟨t⟩ .

Step 8: Compute the partial derivative of temporary memory cell weights W_č with respect to loss L⟨t⟩ .

Step 9: Compute the partial derivative of the update gate cell u⟨t⟩ with respect to loss L⟨t⟩ .

Step 10: Compute the partial derivative of update gate weights W_u with respect to loss L⟨t⟩ .

Step 11: Compute the partial derivative of the forget gate cell f⟨t⟩ with respect to loss L⟨t⟩ .

Step 12: Compute the partial derivative of forget gate weights W_f with respect to loss L⟨t⟩ .

--

--