Disassembling Recurrent Neural Networks

Anish Agarwal
6 min readJul 10, 2018

--

Why RNNs?

Recurrent Neural Networks (RNNs) are neural networks which are good at modelling sequences of data. Sequences of data includes things like music, voltage/current over time, torque of a certain bolt on a manufacturing line measured across multiple components on that line or vision data (image frames) over time. A recurrent neural network can model these sequences of data, predict their behavior and even generate new sequences similar to the original, but novel nonetheless.

On a more personal level, if you stare at a bright light for say 10 seconds, and then suddenly close your eyes, you will see a residue of the illumination in your visual cortex. This is an example of a biological recurrent neural network in action within the human brain. The exact working principles (forward/back propagation) might differ but the general effect is the same, sensory input is passed through from one propagation to the next. That is, we humans have all, whether we are aware of it or not, experience and use RNNs on a daily basis! With this we can appreciate and acknowledge the fundamental relevance of RNNs in the field of artificial intelligence.

This article will attempt to give the reader a basic understanding of RNNs, both the basic RNN and the version of it which is more present in the applied RNNs, Long-Short Term Memory (LSTM) Networks. Keep in mind this article assumes a understanding of basic deep learning concepts such as forward/back propagation, activation functions, probability theory, etc.

As a general rule, a good way to understand how something works is to take it apart and see how it operates. As such, we will do exactly the same. We will take apart an RNN and inspect the insides looking at things like, what is a cell? How are cells connected? What are the inputs and outputs? What are the internal matrices within an RNN? What kind of RNN configurations are possible? With this I hope you can get a brief primer on RNNs, be amazed at their capability and look forward to their future potential. So lets get to it.

The Network Architecture

A basic RNN is shown in the image below. The top shows a “rolled” representation of a RNN. The arrow that loops back from the output to the input of the RNN is what allows data from the one iteration (of forward/back propagation) to pass through to the next. In the first step of disassembly, we will “unroll” the RNN to produce the representation shown underneath.

In the unrolled representation we can observe there are 3 RNN “cells”. Each cell has 2 inputs and 2 outputs represented by the arrows. Any arrow where the head faces towards the cell is an input and any arrow where the head faces away from the cell is an output. Arrows which are vertical represent external inputs (e.g. input data) and external outputs (e.g. predictions). Arrows which are horizontal represent internal inputs (e.g. previous cell state) and internal outputs (e.g. current cell state). In the image there are 3 cells, however you can have more or less RNN cells depending on the size of the data you are feeding into the RNN and the configuration of the RNN.

So how is this structure of an RNN useful in modelling sequences of data? Well each data point is passed through the cell sequentially in many iterations. The cell state which is produced at each iteration is updated and retained across iterations. This means that when a data point is sent to a cell, the cell state at that time was affected by all prior data points and in the order they were sent. So at every iteration the entire sequence of data is taken into consideration. Now how far back in a sequence of data a cell can model patterns depends on the structure of both the cell and network. As you will see further down below LSTMs are a variation on the basic RNN cell which can better model long-term patterns in data.

There are many different ways to structure an RNN some of which are shown below along with an example of where each structure is used. Note newer RNNs structures are ever-evolving but these are some of the basic ones.

This is a high level picture of an RNN. Lets get further into our disassembly, lets take apart a cell.

The Cell

The internals of a cell look like the following. It might seem intimidating at first but it’s really quite simple.

The RNN cell gets as input 2 matrices, Xt (input data) and Ct-1 (cell state). These inputs are multiplied by some weights (Wc and Wx) and are added with a bias (bc). Then we apply an activation function usually this is one of tanh, sigmoid or a rectified linear unit (ReLU). The matrix produced after activation is the new cell state, which will then be sent to the RNN cell in the next iteration. To get the output from the RNN we will do a softmax on the output cell state. The softmax effectively converts the output matrix into a matrix of probabilities. To train the network, a variation of the back propagation algorithm, backpropagation through time, is used to correct and direct the weight and bias matrices so that the RNN will behave with the intended purpose.

Now this is a basic version of an RNN cell. There are other type of cells, such as a Gated Recurrent Unit (GRU), and the aforementioned LSTM each of which have certain advantages and disadvantages. We will take apart an LSTM cell as this is one of the more commonly used RNN cells.

In disassembling an LSTM cell, we find the following. Whereas the basic RNN cell has only a single “gate” to train the model, the LSTM seems to have 3 major gates. By gate I am referring to a matrix multiplication with a weight matrix, followed by a bias matrix addition and finally applying an activation function.

In the basic RNN cell, long term patterns are not recognized as well because the cell is not considering what to update within the weight and bias matrices, it just updates all values in the matrices based on the current input that is received. On the other hand, in the LSTM, the 3 gates allow the cell to reconsider exactly what it wants to update with the result of retaining long-term patterns. It does this by having 2 separate gates, the forget gate and the update gate. The forget gate determines what part of the input is irrelevant in the long-term pattern of the data and does not need to be updated. In the update gate it is determined what is crucial to update in the cell state Ct for the long term. Only after the cell has decided what should be forgotten/updated does the cell update the weights and matrices based on the current input using the output gate. These extra gates allow the LSTM, as the name implies, to remember long and short term patterns in data.

What next?

So this completes our disassembly of an RNN, through which we can see the operational behavior both at the network and cell levels. It will be exciting to see how these networks are applied to more advanced systems. Take for example the concept of multi-layered RNNs, which generalize the long-short dependency modelling behavior of LSTMs even further to the point where they are grouped together into their own sub-field of deep learning called Attention Models. If you think about it, finding long-short term dependencies in data is really about attention. The algorithm needs to ask the question what part of the data should I pay attention to in the short-term and what part should I pay attention to for the long-term? Multi-layered RNNs are a really promising method of achieving this. It will be interesting to see how RNN theory will change/develop over time and in what form it will ultimately be applied.

--

--