Language modelling using Recurrent Neural Networks - Part 2

Tushar Pawar
Praemineo
Published in
7 min readDec 13, 2017

This is a 3 part series where I will cover

  1. Introduction to RNN and LSTM.
  2. Building a character by character language model using tensorflow. (This post)
  3. Building a word by word language model using Keras.

In the previous part we have covered the basics of RNNs and LSTMs. In this part we’ll use tensorflow to make our own language model using LSTMs. All the code has been shared on GitHub.

The training data

Huge thanks to the folks at IMSDB for maintaining the scripts of a huge number of TV shows and movies. I used this code to download all the scripts. Thanks to Jeremy Kun on GitHub. Also, I have provided the scripts of my favourite TV show FRIENDS. I scraped this site to download all the scripts. Thanks to Nikki at LivesInABox.

After downloading all the scripts, I concatenated them into a single file which turned out to be 2.6MB. This is a small snippet of the training data.

The model

The size of the model depends on various factors such as how much computing power do you have, how much memory you have and also the size of your training data. We’ll be making a model with a single layer of 1024 LSTM cells.

preeminence_utils is a library we have built as a collection of utility functions that are helpful in saving time on common tasks such as saving, restoring a model, model visualisation etc. You can check it out at our Github repo.

vocab is a variable that has all the letters that appear in the training data. This includes all letters, numbers and punctuation marks. characters2id and id2characters are the dictionaries to map the characters in vocab to an index and index to characters respectively. These are helpful in converting the training data into one-hot encoding for training and converting the predictions back to characters for testing. We’re choosing section_length to be 50, so the model will take 50 characters as an input and predict the next character that is most likely to come. This is because the context can be captured to an acceptable extent in 50 characters. A higher number can also be used but this comes at a cost of memory.

From line 11–15, the corpus is divided into chunks of 50 characters each and is appended to the sections list. This is the input text and the next character after the 50 character section is its output label. This label is appended to section_labels list. Also, notice that I’m skipping 10(step) characters between each chunk. This is done to save memory. You can use step size of 1 which will give more training data to the model and will perform better.

From line 17–22, the data is one-hot encoded. It means that we are replacing each character by an array of zeros of size of our vocabulary and making the index of that character 1. The shape of our input data and label data is (260539, 50, 111) (260539, 111) respectively. Here 260539 is the number of samples. 111 is the size of vocabulary. The shape of input sequence is (50,111) and its corresponding label is (1,111)

Inputs: RED and Labels: GREEN

We initialise a new model using the tf_utils library which will return a model object. We get the graph for that object using model.init() function and set it as default graph so that we can add tensors to it. We also initialise some hyperparameters such as batch_size and hidden_nodes ie number of LSTM cells.

Here we define our input and labels placeholders. Notice the shape of X is [None,section_length,vocab_length] which is (batch_size,50,111) and Y is [None,vocab_length] which is (batch_size,111).

LSTM model (r2rt.com)

Here W and b are the output weights and biases that will be used to get a prediction from the LSTM model. We define an lstm function that takes input data, weights and biases and returns a prediction for those inputs. In the image above, n is our hidden_nodes variable. The list on the bottom ie rnn_inputs is our X placeholder and the list on the top ie predictions is the collection of outputs of every cell in our model. This corresponds to the outputs variable on line 11. The states variable on line 11 is the list in the middle ie rnn_outputs. We only care about the last output of our model since it captures the context of the input till now. On line 12 we’ll return the final prediction by multiplying the last output by weights and adding the bias. Checkout this article by r2rt to get an in depth explanation of the structure.

Now that we have defined the structure of our model, we need to define some training and evaluation metrics.

On line 1, we’ll call the lstm function to get our logits and pass it through a softmax function to get the predictions. We’ll be using cross entropy loss to compute our gradients and gradient descent optimizer to train our network. You can learn more about these in the previous series Neural Networks.

Let’s train our model.

Call to the model.session() function will return a tensorflow session. If this is the first run, we’ll use line 2 to initialise all our variables. If we’re continuing the training from previous run, model.restore_weights() function will take care of detecting the latest weights and restoring the models with the weights. Similarly, model.train() function will run the training epoch on the data and will take care of batching and printing the progress. model.save() function will save the current weights to the given location.

Using preeminence_utils.tf_utils library is really helpful in reducing the hassle of restoring weights, printing training progress and saving weights. It cuts down the training code from 20–25 lines to just 6 lines. Its really easy to get started and use. Checkout the library at our github repo.

This function is used to get a random character from the prediction according to its probabilities. The output of our network is the array of probabilities of a character in the vocabulary being the next character. So this function will add a randomisation factor to these probabilities defined by the temperature and return new probabilities.

It took me around 6–7 hours to train this network on AWS for 310 epochs with step_size=1. Now that we have our trained model, lets test its output.

On line 3, we’ll restore the weights into our model. start_index is a random index in the text that we’ll use as our starting point. We’ll feed this sentence into our model and continuously append the output to the input text and generate new characters. Here’s what the output looks like throughout the training process.

As you can see, the model learned the structure of the script and remembered to start a new line after a couple of words instead of one huge line. It also learned the names of the characters and also that every dialogue starts with a capitalised name and then the dialogue from next line.

--

--