NLP using RNN — Can you be the next Shakespeare?

Shaan Kohli
Analytics Vidhya
Published in
8 min readApr 20, 2020

Co-author: Venkatesh Chandra

Predictive Keyboard on iOS device

Have you ever wondered how tools like predictive keyboards work on your smartphone? In this article, we explore this idea of generating text using prior information. Specifically, we will be generating passage from 16th-century literature using Recurrent Neural Networks (RNNs) and Natural Language Processing (NLP) on Google Colab. This idea is simple, we will attempt to feed a model a sample Shakespeare play to generate fake parts all while maintaining the same vernacular. While predictive keyboards generate the best possible “one-word” fits to the incomplete sentences that may have multiple words, we will be making this process a bit more difficult by using a single word to generate a section of Shakespeare’s play.

A Portrait of William Shakespeare (Or is it?)

Understanding NLP and RNNs

Let’s first refresh the concept of RNNs for NLP. RNNs are widely used for forecasting. The dataset constraint on RNNs is that it should be in the form of a time series. NLP is a field in Artificial Intelligence that gives the machines the ability to read, understand and find patterns in textual data.

Think of it this way — We can convert letters in a text into numbers and feed it into an RNN model to generate the next possible outcomes (sounds like forecasting, correct?)

Variations in RNN

Diagram representing internal mechanisms for different RNNs

RNNs possess a looping mechanism that acts as a path to allow information to flow from one step to the next. This information is the hidden state, which is a representation of previous inputs.

RNNs have many different variations, most commonly LSTMs (Long-Short Term Memory). In this article, we will make use of a lesser-known variation called Gated Recurrent Units (GRUs). The key distinction between simple RNNs and GRUs is that the latter supports gating of the hidden state. As mentioned previously, the hidden state allows us to input information from prior timesteps. Thus, the way RNNs and GRUs differ is the way this information is passed. The distinction lies in the dedicated mechanisms for when a hidden state should be updated and also when it should be reset.

Don’t worry if the previous paragraph went over your head! LSTMs and GRUs are a bit difficult to grasp in the first go. To summarize, GRUs are quite similar to LSTMs. The only difference is that GRUs do not have a cell state and uses the hidden state to pass information. In fact, a GRU has two gates: an Update Gate and a Reset Gate. The Update Gate acts similarly to the forget and input gate of an LSTM. It decides what information to throw away and what new information to add. The Reset Gate is another gate used to decide how much past information to forget. For a detailed explanation, you can watch this video to understand the process.

Now which one is good for us? A simple RNN, LSTM, GRU? Like all things in life, nothing is clear cut. Everything depends on the use case, amount of data and performance. So, the decision is up to each individual!

Generating Shakespeare Plays using GRUs

We will now be using text from the play Romeo and Juliet to generate some “fake passages” that mimic 16th-century literature. To do this, we have extracted a given amount of data from https://www.gutenberg.org/.

Dataset link — https://www.gutenberg.org/ebooks/1112

You can remove the initial pages of the book from the .txt file which has the contents and acknowledgments section. This will help result in a better model.

We will develop a model that uses a prior sequence of characters to predict the next highest probability character. We must be cautious as to how many characters we use. On one hand, using a very long sequence will require a lot of training time and is most likely overfit to a sequence of characters that are irrelevant to characters farther out. On the other hand, too short of a sequence will underfit our model. Thus, we build an intuition from the length of data that we have. Based on the length of the normal phrases, we will be using one word to predict the next 180 characters.

Time to jump into action! Follow along the steps below:

Note: Google colab link is provided below

Step 1: Data Import & Essential Functionalities

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

Import the dataset (Google colab examples)

input_text = uploaded[‘romeo_juliet.txt’].decode(“utf-8”)

Step 2: Data pre-processing

Create a unique set of characters:

letter_corpus = sorted(set(input_text))
letter_corpus[:10]

Encode alphabets to numbers

char_to_ind = {u:i for i, u in enumerate(letter_corpus)}
ind_to_char = np.array(letter_corpus)
encoded_text = np.array([char_to_ind[c] for c in input_text])

Step 3: Check sequence

The length of each sentence is 43 and if we capture three sentences, the model should be able to pick up the patterns and learn them.

part_stanza = “””Chor. Two households, both alike in dignity,
In fair Verona, where we lay our scene,
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean”””
len(part_stanza)#Output - 181

Step 4: Train the sequence

seq_len = 180
total_num_seq = len(input_text)//(seq_len+1)
#We obtain 972 sequenceschar_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)
sequences = char_dataset.batch(seq_len+1, drop_remainder=True)
#drop_remainder=True ensures that the last batch gets dropped if it #has less number of words

Map the sequence to the dataset

dataset = sequences.map(create_seq_targets)

Step 5: Create batches

batch_size = 1buffer_size = 10000dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

Time to create out network

vocab_size = len(letter_corpus)
embed_dim = 64
rnn_neurons = 1026

Import the tensorflow models, layers and loss function

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense,Embedding,Dropout,GRU
from tensorflow.keras.losses import sparse_categorical_crossentropy

We define the loss function

def sparse_cat_loss(y_true,y_pred):
return sparse_categorical_crossentropy(y_true, y_pred,
from_logits=True)

Create the model

def create_model(vocab_size, embed_dim, rnn_neurons, batch_size): model = Sequential() model.add(Embedding(vocab_size, embed_dim,batch_input_shape=
[batch_size, None]))
model.add(GRU(rnn_neurons,return_sequences=True,stateful=True,recurrent_initializer=’glorot_uniform’))
# Final Dense Layer to Predictmodel.add(Dense(vocab_size))model.compile(optimizer=’adam’, loss=sparse_cat_loss, metrics = ['accuracy'])return modelmodel = create_model( vocab_size = vocab_size, embed_dim=embed_dim, rnn_neurons=rnn_neurons, batch_size=batch_size)

Our model looks like this now:

model.summary()
Model architecture

Time to Train

Muhammad Ali in action, just like our model!

Set the epochs to 30

epochs = 30

Train the model. Note that this will take some time. It took us ~40 minutes to train the dataset

model.fit(dataset,epochs=epochs)

Model Evaluation

We below code to save the model history and plot the reporting metrics

losses = pd.DataFrame(model.history.history)

losses[[‘loss’,’accuracy’]].plot()
GRU model training results

Notice how the loss decreases till the 20th epoch and then shoots up. The highest accuracy obtained is 86.03% on the 18th epoch. As such, we have trained our model for 18 epochs.

Generating the Text

We define a function (without fixing a seed) to generate texts using a sequence of 1. If we had trained the model on two words, then our model would be more powerful, however the training time increases.

def generate_text(model, start_seed,gen_size=100,temp=1.0):num_generate = gen_sizeinput_eval = [char_to_ind[s] for s in start_seed]input_eval = tf.expand_dims(input_eval, 0)text_generated = []temperature = tempmodel.reset_states()for i in range(num_generate): # Generate Predictions predictions = model(input_eval) # Remove the batch shape dimension predictions = tf.squeeze(predictions, 0) # Use a cateogircal disitribution to select the next character predictions = predictions / temperature predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy() # Pass the predicted charracter for the next input input_eval = tf.expand_dims([predicted_id], 0) # Transform back to character letter text_generated.append(ind_to_char[predicted_id])return (start_seed + ‘’.join(text_generated))

For generation, you can use the below code where you just need to specify the starting word and the number of consecutive words.

print(generate_text(model,”But”,gen_size=1000))

Output

We obtained the below output for the word “flower”

flowers, with power,
thitiest ince to speak.
Enter Lady, packed turnare didomaid,
O hands the fair creastrim! bystalt me this sare!
Out you? O, pies, peach, ar, and outsides.
Enter Julie.Cep.’ Hath you, Caup with such scater, ose must reports! colal, with so smally,
‘Year ‘ads-noods withal.
Cap. Ay thou hast thou been shopy sender hase.Ey’ WAtch make close curtain, with the humour times.
O, good night, s oppriwite up, in displayd-by so night raught
Shall back that hous shalt confurers to away in this?
Jul. He case us borny, my chall, is fould wish permission.
Give me thy shrew, so bir, sighs all,
Apphrel thee but but my lips?
Jul. Ay, noinot makes, I cave me not doth she country.Man. The sorisim! O O, Capulet,
Mush fairence the murte-baggage of Montaghous.
Where viewild you but ny yo,
Her simps to the-
Ben.

Notice how our model initializes the names for Juliet and Ben. Also, it picks up the patterns as the sentences end with punctuations and mimics 16th-century prose such as Ey, thee, thou, etc.

Output for the word “But”

But,
Who say wethith is the day purg’d which your bight!
Are providunity nurse, that mark much soul-
D.ASCup on phforight or verfain, is doth compirications comes and curnais,
How?
Allotions back,
Where I sear and kindroud.
A plaguage of gracksten, creptain!
Show her plamangled us, now, sir?
Wife. Spaker, you, sir.Cap. What [and] Buy Halth will’dinging, non, and pular our soul
And lovely dreamerly eress murkdeds
Whose she beshes eyes will be in thy brace!
Enter Appraide his banished.Ben. I can you like whose’s keaus.Speak. ’Tis answer ‘I’ shall up, corpudin!
She [and by] Well, sight as as a know this may be the hight comes Refuchis works corns.
Par. So am I conduct and Montague, Commend.Extut may cell till me comes she wret?
Marry, the maid shrifid- grovimeo,
Whoce arm Louren lover’d right.
Th

This model is so good that it set speech for the Wife to call the man by Sir!

Conclusion

We see that our model mimics the way that the play Romeo and Juliet were written. Notice the beginning of sentences that refer to characters. Additionally, the entire vernacular is copied.

In addition to training the model with Romeo and Juliet, we looked to take a similar approach with other texts such as Pride and Prejudice and car reviews from Edmunds. While the model training with the first showed promise, the latter did not meet expectations. Specifically, for reviews, the model did not perform optimally, as it could not find patterns. This most likely has to do with the way reviews are written. Most people have different writing styles which makes it hard for the model to mimic prose.

In the future, it will be interesting to explore such an approach when looking at tweets and how we could implement such a model with fake tweets. But why only tweets? Ideally, we could also look at fake online articles or even fake WhatsApp news (especially during elections).

You can find the links to the code below:

Connect with us on LinkedIn for more stories.

--

--