Deep Learning: Recurrent Neural Networks

Published in

deeplearningbrasilia

4 min readOct 18, 2018

Recurrent Neural Networks (RNNs) are the Neural Network tools for problems that deal with sequential data.
It became increasingly more popular due to their great results in Natural Language Processing (NLP).
Within NLP they are used for the most varied tasks like translation, text classification, automatic text generation.

Here, I briefly explain the basic structure of RNNs

Basic Building Block

RNNs are normally shown like the picture below

Representation of RNN both in folded and unfolded forms

In the picture, there are some distinct components, from which the most important are:

x: The input. It can be a word in a sentence or some other type of sequential data
O: The output. For instance, what the network thinks the next word on a sentence should be given the previous words
h: The main block of the RNN. It contains the weights and the activation functions of the network
V: Represents the communication from one time-step to the other.

The folded and unfolded representations of the network in the picture are equivalent. It is sometimes useful to unfold the network to get a better understanding of what is happening at each step.

One very important aspect to notice is that, even though the unfolded version shows several h blocks, the h block used is always the same.
The h block sends its output back to itself.
It keeps doing that until it is told to stop.

The simplest kind of h block is in the picture below

This block simply combines the input — a word for instance — with the output of itself from the previous time-step and passes it though a tanh activation to get the output of the current time-step.

To illustrate how it works, we can think about the problem of predicting the next word of a phrase. Let’s take the phrase “Bananas are” as an example. The task should be to give us what is the third word.

On the first time-step, we feed the network the word Bananas. That word will go through the h block. Then the output of the h block will go inside itself along with the next word are. At this point, the network stops feeding itself and gives us what it thinks should be the next word. In this case, it could say yellow.

Long short-term memory (LSTM) and Gated Recurrent Unit (GRU)

The basic building block for RNN shown above suffers from some problems. One of the most important is the inability to retain information when the sequence given is long. It forgets information that was supplied several time-steps ago. That limits the learning performance. So people created some architectures to tackle that. The most popular are Long short-term memory (LSTM) and Gated Recurrent Unit (GRU)

These two are sibling architectures which are used in the great majority of applications. The distinction of these two with relation to the basic one is what the h block contains. I will not go in detail about each one, but the pictures below show the inner components of both LSTM and GRU.

Deep RNN

It is possible to concatenate the building blocks we have seen previously to create a longer network. The picture below show the representation of what would be a network with two RNN layers.

Each of these will be unfolded several times when you want to train or perform inference on the network. Then, it is not common to concatenate many of these RNN blocks since it would turn the network too computationally demanding to be used in practice.

The architectures tend to use up to 3 concatenated RNN layers, but it is rare to use much more than that.

Back Propagation Through Time (BPTT)

The Back Propagation algorithm used to train neural networks receives a special name in the case of RNNs. It works the same way, which is applying the chain rule on the network and updating the weights going from last to first layer.

It receives this special name because there are not really distinct layers that are updated, but the same one that is updated at each time-step.

Final Thoughts

Compared to other architectures like the famous Convolutional Neural Networks (CNN) widely used in computer vision, the RNNs are simpler. They contain fewer blocks and layers. However they might still require a lot of computational power due to the way they need to be unrolled.

Finally, even though RNNs have proven themselves by achieving incredible performance in sequence related tasks — in special NLP applications — I do believe we will still have some great improvements in the next couple of years.

If you liked, the post, follow me, and give it a clap. Let me know what else you would like me to write about