Gated Recurrent Units explained with matrices: Part 2 Training and Loss Function

4 min readMar 6, 2019

by: Sparkle Russell-Puleri and Dorian Puleri

In part one of this tutorial series, we demonstrated the matrix operations used to estimate the hidden states and outputs for the forward pass of a GRU. Based on our poor results, we obviously need to optimize our algorithm and test it on a test set to ensure generalizability. This is typically done using several steps/techniques. In this tutorial we will walkthrough what happens under the hood during optimization, specifically calculating the loss function and performing backpropagation through time to update the weights over several epochs.

What’s happening here?

As we pointed out in the first tutorial the first couple of strings generated are a bit erratic, but after a few passes it seems to get at least the next two characters correct. However, in order to measure how inconsistent our predictions are versus the true labels, we need a metric. This metric is call the loss function, that measures how well the model is performing. It is a positive value that decreases as the network becomes more confident with it’s predictions. This loss function for multi-class classification problems is defined as:

Recall, our calculated hidden states and predicted outputs for the first batch? This picture seems a bit busy, however the goal here is to visualize what you outputs and hidden states actually look like under the hood. The predictions are probabilities which were calculated using the Softmax activation function.

Let’s re-run the training loop storing the outputs ( y_hat) and hidden states (h_(t-1), h_t, and, h_(t+1)) for each sequence in batch 1.

Illustration in code:

To understand what is happening you will notice that we work from the inside out, before moving to functions. Here, we are grabbing the outputs and hidden states calculated with just two loops.

The cross entropy loss is first calculated for each sequence in the batch then averaged over all sequences. So, in this example we will calculate the cross entropy loss for each sequence from scratch. But first, let’s grab the predictions made on the first batch. To do this we will grab the for element ( index 0) from our ht_2 and outputs variables.

Model predictions and hidden states for the first batch

How well did we perform?

By looking at the output probabilities we can tell that we did not do so well. However, let’s quantify it using the cross entropy equation! Here we will work our way from the inner term out on the first sequence in the batch. Note, the code will included all 3 sequences in batch 1.

First term: Element-wise multiplication of the true labels with the log of the predicted labels

Implementation in code:

Second term:Summation of remaining values within each sequence. In this step, it is key to note that the axis will be reduced row-wise, only containing the non-zero terms. This will be done in a loop programatically.

Implementation in code:

Third term: Mean of the reduced samples for first sequence within the batch tow-wise. This example calculation was done on the first sequence within batch 1. However, the code implementation covers all 3 sequences in batch 1.

Implementation in code:

Averaging the cross entropy losses of each sequence within batch 1

Note, in practice this step will be done over each mini-batch by keeping a running average of the losses for each batch. It essentially sums up what we calculated for the cross entropy (loss for each sequence in batch 1) and divides it by the number of sequences within the batch.

How did we do?

A batch loss of 1.2602 is high and means that we have plenty room for improvement.

Explanation

So we optimized reduced our loss and we are not predicting well…why? Well, as mentioned in the first tutorial this is an extremely small dataset, when training on a neural net made from scratch. It is recommended that you do so with lots of data. However, the purpose of this tutorial is not to create the high performance neural net, but to demonstrate what goes on under the hood.

Backpropagation

The final step involves a backward pass through the algorithm. This step is called backpropagation, and it involves understanding the impact of adjusting the weights on the cost function. This is done by calculating the error vectors delta starting from the final layer backward by repeatedly applying the chain rule through each layer. For more detailed proof of back-prop through time: https://github.com/tianyic/LSTM-GRU/blob/master/MTwrtieup.pdf

References:

1. The Unreasonable Effectiveness of Recurrent Neural Networks
2. Udacity Deep Learning with Pytorch
3. Fastai Deep Learning for Coders
4. Deep Learning — The Straight Dope (RNNs)
5. Deep Learning Book

Gated Recurrent Units explained with matrices: Part 2 Training and Loss Function

What’s happening here?

Illustration in code:

How well did we perform?

Averaging the cross entropy losses of each sequence within batch 1

How did we do?

Explanation

Backpropagation

References:

Written by Sparkle Russell-Puleri