Language modeling using Recurrent Neural Networks Part - 1

Published in

Praemineo

5 min readDec 4, 2017

This is a 3 part series where I will cover

Introduction to RNN and LSTM. (This post)
Building a character by character language model using tensorflow.
Building a word by word language model using Keras.

First of all, lets get motivated to learn Recurrent Neural Networks(RNNs) by knowing what they can do and how robust and sometimes surprisingly effective they can be. This amazing blog by Andrej Karpathy will help you get started.

Language modelling

Our aim here is to make a neural network that can learn the structure and syntax of language. We’ll provide a huge set of dialogues from the scripts of 2 of my favorite shows F.R.I.E.N.D.S and South Park as training data and hope that our model learns to talk like the characters.

RNNs

Basic layered neural networks are used when there are a set of distinct inputs and we expect a class or a real number etc.. assuming that all inputs are independent of each other. RNNs on the other hand, are used when the inputs are sequential. This is perfect for modelling languages because language is a sequence of words and each next word is dependent on the words that come before it. If we want to predict the next word in a sentence, we should know what were the previous words. This is analogues to the fact that the human brain does not start thinking from scratch for every word we say. Our thoughts have persistence. We’ll give this property of persistence to the neural network with the use of an RNN.

A very famous example is “I had a good time in France. I also learned to speak some _________”. If we want to predict what word will go in the blank, we have to go all the way to the word France and then conclude that the most likely word will be french. In other words, we have to have some memory of our previous outputs and calculate new outputs based on our past outputs.

Another advantage of using RNNs for language modelling is that due to memory constraints, they are limited to remembering only a few steps back. This is ideal because the context of the word can be captured in the 8–10 words before it. We don’t have to remember 50 words of context. Although RNNs can be used for arbitrarily long sequences if you have memory at your disposal.

The general structure of a RNN looks like..

While designing the RNN, we only have to care about the top structure. The structure on the bottom is the unrolled version of an RNN through time. It means that if we want to remember 8 words in the past, it will become a 8 layer neural network. xₜ is input to the network at time step t. x₁ is the input at index 1 in the sequence. sₜ hidden state of the network at time t. It is the memory of the network. This corresponds to the weights in a normal neural network which are learned at the time of training. It is calculated using U, the current input and W, the output of previous time step. sₜ = f(Uxₜ + Wxₜ₋₁). Where f is a non-linearity function, typically, Tanh or ReLU. The fact that U,V, and W are same for each time step denotes that the network is sharing the same parameters for each layer. Therefore, it is performing the same operations, but with different values of U,V and W at each time step.

RNN Cons

Till now we discussed all the things that RNNs can do. But they are not perfect. As the context length increases, ie. our dependencies become longer, layers in the unrolled RNN also increase. Consequently, the network suffers from vanishing gradient problem. As the network becomes deeper, the gradients flowing back in the backpropagation step become smaller. As a result, the learning rate becomes really slow and makes it infeasible to expect long term dependencies of the language. LSTMs to the rescue!

Long-Short Term Memory (LSTM)

LSTM networks are just an advanced version of plain RNNs that we discussed above. These networks are capable of remembering long-term dependencies. They are designed to remember information for long periods of time without having to deal with the vanishing gradient problem. The structural difference between plain RNN and LSTM is as follows.

In the basic RNN network, each layer consists of a simple function f(Uxₜ+Wxₜ₋₁), where f is an activation function such as Tanh or ReLU. While training the network, it backpropagates through all the layers and consequently suffers vanishing gradient problem.

An LSTM cell consists of some additional layers to basic RNN cell. These additional layers form what is called as a memory and forget gates.

In the LSTM cell, the vanishing gradient problem is solved by writing the current state as memory of the network. This writing process is regulated by memory and forget gates. If you want to remember something, you write it down. Similarly, LSTMs also have the same intuition. However, writing down every step is tedious and can certainly go out of hands if not regulated. It will blow up the memory constraints if the “writing down” process is not regulated and everything is written.

To tackle this problem, LSTMs are designed to be selective in three things.

Writing selectively: Only important things should be written down. The network learns which features are important and writes them down.
Reading selectively: Read only the features that are relevant to the current state.
Forget selectively: We need to throw away the stuff that we don’t need anymore to make room for new stuff.

All these actions are performed by the memory and forget gates.

Checkout this deeper explanation of structure of LSTM by R2RT.

In the next part we’ll use Tensorflow to make our own language model.

References:

Colah’s blog — Understanding LSTM Networks
Andrej Karpathy blog — The Unreasonable Effectiveness of Recurrent Neural Networks
R2RT — Written Memories: Understanding, Deriving and Extending the LSTM
WildML — Recurrent Neural Networks Tutorial