Sequence Modelling

Published in

Machine Learning Basics

6 min readDec 27, 2018

Sequence Modelling is the ability of a computer program to model, interpret, make predictions about or generate any type of sequential data, such as audio, text etc. For example, a computer program that can take a piece of text in English and translate it to French is an example of a Sequence Modelling program (because the type of data being dealt with is text, which is sequential in nature). An AI algorithm called the Recurrent Neural Network, is a specialised form of the classic Artificial Neural Network (Multi-Layer Perceptron) that is used to solve Sequence Modelling problems. Recurrent Neural Networks are like Artificial Neural Networks which has loops in them. This means that the activation of each neuron or cell depends not only on the current input to it but also its previous activation values.

The architecture of an RNN is also inspired by the human brain. As we read any essay, we are able to interpret the sentence we are currently reading better because of the information we gained from previous sentences of the essay. Similarly, we can understand the conclusion of a novel only if we have read the beginning and middle of the novel. The same logic follows for audio as well. On a basic level, interpreting a certain part of a sequence requires information gained from the previous parts of the sequence. Thus, in a human brain, information that persists in our memory while interpreting sequential data is vital in understanding each part of the sequence. Similarly, RNNs also try to incorporate this capacity of memory by updating something called the “state” of its cells each time we move from one part of a sequence to another. The state of a cell is basically the total information gained by it so far by reading the sequence. So, the current state or knowledge of a cell in an RNN is not only dependent on the current word or sentence it is reading, but is also dependent on all the other words or sentences it has read before the current one. Thus the name Recurrent Neural Network. (Classic ANNs do not have this mechanism of memory. An ANN neuron’s current state depends only on the current input as it discards information about the previous inputs to the cell)

The first image above illustrates a recurrent neuron or cell. It is a simple neuron that has a loop. It takes some input x and gives some output h.
This neuron can be thought of as multiple copies of the same unit or cell chained together. This is illustrated by the second image, which shows an “unrolled” form of the recurrent neuron. Each copy or unit passes a message (some information) to the next copy.

In Recurrent Neural Networks, there is a concept of time steps. This means that the recurrent cells or units take inputs from a sequence one by one. Each step at which the cell picks up an input is called a time step. For example, if we have a sequence of words that form a sentence, such as “It’s a sunny day.”, our recurrent cell will take the word “It’s” as its input at the first time step. Now it stores information about the word “It’s” in its memory and updates its state. Next, it takes the word “a” as its second input at the second time step. Now it incorporates information about the word “a” into its memory and updates its state once again. It repeats the process until the last word. Therefore, the cell state at the 1st time step depends only on the 1st input, the cell state at the 2nd time state depends on the 1st and 2nd inputs, the cell state at the third time step depends on the 1st, 2nd and 3rd inputs and so on. In this way the cell continuously updates its memory as time passes (similar to a human brain).

Referring to what you learnt from the previous paragraph to the images above; we can say that $latex \Large{x_1}$, $latex \Large{x_2}$, $latex \Large{x_3}$ and so on are the inputs to the recurrent cell at the 1st, 2nd, 3rd and so on time steps. At each time step, the recurrent cell updates its state based on the current input, gives an output vector h and then moves on to the next time step. This is demonstrated in the “unrolled” RNN diagram above.

Therefore, we need 2 separate weight matrices at each time step to calculate the current state of the recurrent cell. One matrix W and another matrix U are used. Matrix W is multiplied by the current input and the matrix U is multiplied by the previous state of the cell (at the previous time step) and the two products are added. A bias vector b can be added to the sum. Then, the whole sum can be passed through an activation function like ReLU, Tanh or Sigmoid to form the new updated state of the cell (The activation function is used to introduce non-linearity into the network so that it can fit more complex functions). So, the update formula can be written as:

$latex \huge{h_t + 1 = W \cdot h_t + U \cdot x_t}$, where $latex h_t$ is the is the cell state at time step t and $latex x_t$ is the cell input at time step t.

The RNN

Many such Recurrent Neurons stacked one on top of the other (which may include some Densely Connected Layers at the end) forms a Deep Recurrent Neural Network or DRNN.

A Deep Recurrent Neural Network. The outputs of the lower layers are fed as inputs to the upper layers (at each time step). For example, in the above figure, the output of the lowest layer at time step $latex x_(t — 1)$ is fed as input at the $latex x_(t — 1)$ time step in the middle layer. With multiple recurrent units stacked one on top of the other, a DRNN can learn more complex patterns in sequential data.

The outputs from one recurrent unit at each time step can be fed as input to the next unit at the same time step. This forms a deep sequential model that can model a larger range of more complex sequences than a single recurrent unit.

Long Term Dependencies

Recurrent Neural Networks face the problem of long term dependencies very often. On many occasions, in sequence modelling problems we need information from long ago to make predictions about the next term/s in a sequence. For example, if we want to find the next word in the sentence “I grew up in Spain and I am very familiar with the traditions and customs of …..”. To predict the next word (which seems to be Spain), we need to have information about the word “Spain”, which is just the 5th word in the sentence. But we need to predict the 17th word in the sentence. This is a large time gap, and RNNs are prone to losing information given to it many time steps back. RNNs are unable to capture these long term dependencies in practice.

Long Short Term Memory Networks

A special type of RNN called an LSTM Network was created to solve the problem of long term dependencies. The constituent cells of an LSTM network each have their own system of gates that decide what information and how much information from the sequence (text or audio) is stored in the cell’s state and how much is discarded at each time step. These gates regulate the state of the cell more effectively and help the cell retain information that it has gained long ago. These systems of gates are parametrized by weight matrices and bias vectors. These parameters are trained using the Back Propagation algorithm. I would suggest the colah blog for understanding LSTMs for a more in-depth understanding of how LSTMs work.