RNN: Training An English Major

Derek Zhang

10 min readApr 17, 2019

So at this point in FIRE, you should know you can’t just feed sentences into a neural network.

You’re missing the keyword:

TENSORS!!!

That sounds scary but it’s really just a fancy word for arrays; trust me.

I will be demonstrating how to train a RNN to write like Shakespeare.

First we will we have deal with data.
Then we will define our model architecture and finally, do training.

For simplicity, the network will predict on the character level. So given a sequence of characters, predict the next characte
. (See what I did there?)

Keep in mind while the code in this article seems polished, I had to struggle a lot. Yes, even peer mentors aren’t neural net gurus. We’re not there yet but rather are getting there…

Review

First, let’s review what an RNN’s architecture is:

Derived image. Original by François Deloche from Wikimedia Commons.

So you essentially have a sequence of n inputs denoted by x. And then you want to generate another sequence of outputs denoted by o.

When we unroll this, we can see more clearly the role the hidden state plays:

For the hidden layer, and you can think of this as the “memory” of the network or if you are doing natural language processing, the “context”.
This is a “simplified” version of what we will actually do. (Depending on your perspective.)

RNNs are extremely helpful for time series data.

Data

Where is this data?

I am getting data from here:

http://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt

Backup link:
https://drive.google.com/file/d/1zkVhSAO5xddlzg59nGOQHrTZzNXTQjbd/view?usp=sharing

His blog covers training an RNN too and it is in PyTorch and Lua. (I will be using Keras of course.)

Since this data is small, we can just load everything in:

with open("shakespear.txt") as f:
  text = f.read()  uniques = list(set(text))
  # consistency for id generation
  uniques.sort()  char2id = {k:i for i, k in enumerate(uniques)}  id2char = {i:k for i, k in enumerate(uniques)}

You will notice this is highly naive as will the rest of the article be.

Why?

The model has to learn lowercase and uppercase letters. That is fine but also there are new line characters and different punctuations.

I decided to just go with this for simplicitly and the fact I found the data is not very noisy.
(It’s Shakespeare! You will fall asleep… no noise. Just kidding, I actually really like the to be or not to be.)

Pre-processing

Now onto converting characters to tensors. The simplest way, as you might guess, is to one hot encode them.

def to_tensor(char2id, text):
 
  tensor = np.zeros((len(text), len(char2id)))
  for i, e in enumerate(text):
  
    tensor[i, char2id[e]] = 1
  return tensor

How do we test or perform a sanity check?

def to_text(id2char, tensor):
 
  char_list = []
  assert len(id2char) == tensor.shape[1]
 
  first, second = np.where(tensor == 1)
  assert first.shape[0] == tensor.shape[0]
  for i in range(second.shape[0]):
    char_list.append(id2char[second[i]])
  
  return ''.join(char_list)

Make a decoder of course! To test you run this and verify that they are the same:

# test with first 8 characters
print(text[:8])
test = to_tensor(char2id, text[:8])
print(to_text(id2char, test))

Now we can write a generator!
Note that I have chosen the “context” characters to be 32 for faster training. This is how many characters the RNN is given to predict the next character.

ARBITRARY_LENGTH = 32
def the_generator(char2id, text, batch_size):  # minus one for target character
  l = len(text) - ARBITRARY_LENGTH - 1
  picker = np.arange(0, l)
  np.random.shuffle(picker)
  
  up_to = l - batch_size
  i = 0
  while True:    if i >= up_to:      i = 0
      np.random.shuffle(picker)    input_list = []
    target_list = []
    for j in range(batch_size):      index = picker[i + j]
      input_list.append(to_tensor(char2id,
        text[index : index + ARBITRARY_LENGTH]))
      target_list.append(to_tensor(char2id,
        text[index + ARBITRARY_LENGTH]))      yield (np.array(input_list), np.array(target_list))
    i += batch_size

Great right? We implemented things modularly and we just need to call to_tensor in the generator function. Now all we have to do is define and train the model.

WRONG!

This implementation (at least for 100 context characters) would take TEN HOURS per epoch to train on Google Colab.

That is without multiprocessing, and a Tesla K80 GPU.

So I had to go back and modify the data generator.

What did I do? Well, multiprocessing requires a class generator. (Otherwise there is some duplication of data error). Think of it as implementing a list.

Luckily, the next batch doesn’t depend on the previous batch here.
Note that I forgot to turn off random for the validation generator.
In this case, I set the batches just right so it is a multiple or something.
You should be more careful than I am and add a random boolean parameter to the constructor.

class DataGenerator(tensorflow.keras.utils.Sequence):
  
  ARBITRARY_LENGTH = 32
  
  'Generates data for Keras'
  def __init__(self, char2id, text, batch_size):
    'Initialization'
    self.char2id = char2id
    self.text = text
    self.batch_size = batch_size
        
    self.l = len(self.text) - DataGenerator.ARBITRARY_LENGTH - 1
    self.picker = np.arange(0, self.l)
    self.on_epoch_end()    self.lenchar2id = len(char2id)  def __len__(self):
    'Denotes the number of batches per epoch'
    return int(np.floor(self.l / self.batch_size))  def __getitem__(self, batch_index):
    'Generate one batch of data'    start_index = batch_index * self.batch_size    input_arr = np.zeros((self.batch_size,
                        DataGenerator.ARBITRARY_LENGTH,
                        self.lenchar2id))
    target_arr = np.zeros((self.batch_size, self.lenchar2id))    for j in range(self.batch_size):
          
      index = self.picker[start_index + j]      for k, e in enumerate(self.text[index : index + DataGenerator.ARBITRARY_LENGTH]):        input_arr[j, k, self.char2id[e]] = 1
      # end for      target_arr[j, self.char2id[self.text[index + DataGenerator.ARBITRARY_LENGTH]]] = 1
    # end for    return input_arr, target_arr  def on_epoch_end(self):
    'Updates indexes after each epoch'    np.random.shuffle(self.picker)################
# create objectsBATCH_SIZE = 32
# Just so happens (1025 - 1 - ARBITRARY LENGTH) % BATCH_SIZE = 0
VAL_SIZE = 1025train_gen = DataGenerator(char2id, text[:-VAL_SIZE], batch_size = BATCH_SIZE)
val_gen = DataGenerator(char2id, text[-VAL_SIZE:], batch_size = BATCH_SIZE)

Since these are list like:

# to test you can
print(train_gen[0])

So it turns out Google Colab kept dying on me because if I idle, it disconnects or something.

So I used my AWS credits that has been sitting and waiting to be used for over one and a half years!

Solution?

Rent a 16 CPU compute optimized “Elastic Computer Cloud” instance on AWS!
It is called ‘c5n.4xlarge’.
(Unforunately I didn’t know until after the fact that I could use my credits for GPUs. If I had, of course I would’ve used the GPU.)

Previous train configuration:

use_multiprocessing=False
workers = 1
max_queue_size = 10

Current configuration:

use_multiprocessing=True
workers = 32
max_queue_size = 4096

Effective?

Oh right how can I forget, I didn’t even show you the model architecture!

Model

So the actual way I am implementing this is as such:

Here we have to classic “TO BE OR NOT TO BE … THE QU” and we want to predict the next character which is E.
Keep in mind that the weights are reused across all inputs. Recall that first picture with one strip.
(Otherwise we would basically have a feedforward network with a lot more weights!)
Also note that I have also abstracted the output of the “intermediate” layer away.

This is what I implemented at first:

input_layer = KL.Input((ARBITRARY_LENGTH, len(char2id)), name="the_input")x = KL.LSTM(62, return_sequences=True, name="intermediate")(input_layer)
x = KL.LSTM(62, activation="softmax", name="the_output")(x)model = Model(input_layer, x)model = Model(input_layer, x)model.summary()model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

There are 62000 parameters.
In practice, Keras does not return each output by default. It only returns the last output.

I have set “return_sequences=True” for the intermediate layer to implement this kind of architecture.

Now does this code seems correct?
You guessed it!

WRONG!

After training for 12 epochs or so, I got loss as NAN!
My gut and research told me it might have had something to do with ‘categorical_crossentropy’ using logarithms.

It would’ve taken me 100 years to debug had I not experience with tensorflow.

I’m being serious!

Instead we will use relu activation:

input_layer = KL.Input((ARBITRARY_LENGTH, len(char2id)), name="the_input")x = KL.LSTM(62, return_sequences=True, name="intermediate")(input_layer)
# used to do softmax but experimenting here
x = KL.LSTM(62, activation="relu", name="the_output")(x)model = Model(input_layer, x)model.summary()

But Derek, that’s wrong! You can’t train categorical with relu. And you’d be right… unless tensorflow comes to the rescue!

x = KL.LSTM(62, return_sequences=True, name="intermediate")(input_layer)
# used to do softmax but experimenting here
x = KL.LSTM(62, activation="relu", name="the_output")(x)model = Model(input_layer, x)model.summary()def customLoss(yTrue,yPred):
  return tf.nn.softmax_cross_entropy_with_logits_v2(yTrue, yPred)model.compile(loss=customLoss, optimizer='adam', metrics=['accuracy'])

Wow, what a very sophisticated one line custom loss function!

What we are doing is moving softmax and categorical cross entropy into the loss function. Tensorflow automatically applies some epsilon and this gets rid of the NAN problem. (I have read the documentation so trust me.)

I believe since softmax is just scaling everything to sum to one, we can still do inference the same by using argmax.

When training, it is very helpful to save the weights periodically in case something goes wrong:

# make sure logs directory ALREADY EXISTS
checkpt = KC.ModelCheckpoint('./logs/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss', verbose=0, save_best_only=False, save_weights_only=True, mode='auto', period=1)

Now we can train :)

VERBOSE = 2# minus 1
EPOCH_START = 0
EPOCH_END = 256model.fit_generator(train_gen, epochs=EPOCH_END, verbose=VERBOSE,
                    validation_data=val_gen,
                    use_multiprocessing=True,
                    initial_epoch=EPOCH_START,
                    callbacks=[checkpt], workers = 32,
                    max_queue_size = 4096)

The complete code can be found here:

https://github.com/chromestone/Random_ML/blob/master/rnn_character/ultimate.py

Results

It’s not the best, but I will just use the latest epoch for fun and see what happens.

For prediction you can use this:

ARBITRARY_LENGTH = 32
def predict(how_many, put_into_this, char2id, id2char, model):
  
  assert len(put_into_this) >= ARBITRARY_LENGTH
  
  for i in range(how_many):
    
    input_this = to_tensor(char2id, put_into_this[-ARBITRARY_LENGTH:])
    max_this = model.predict(input_this[np.newaxis])
    put_into_this.append(id2char[np.argmax(max_this)])

This is one of the most memorable line I read from Hamlet:

Get thee to a nunnery, go. Farew…

Feeding it into the model, the model outputs this:

Get thee to a nunnery, go. Farewell, and bear to the great of the prentnged the manion, and break not the sirrant

Seems like it’s making up words. Cool!

I have a notebook for inference (with a link to the weights) in the Github repo.

https://github.com/chromestone/Random_ML/tree/master/rnn_character

Shortcomings

A pretty evident shortcoming is that the model only predicts on each 32 character. Essentially this architecture forces the model to forget any characters before that.

Another shortcoming is all the different the characters the model has to predict. We used unique characters in the document and trained. We might not have enough data (98KB) to be able to focus on different capitalizations, punctuations, etc.

One of the most important shortcomings is SPARSITY. Only 1 out of the 62 inputs is a 1 and the rest are all zeros! This is not the best way to train. Although optimizers like Adam can handle this, it is still not ideal.

We will see in future articles about embeddings to decrease sparsity in order to have a compact and meaningful representation.

More predictions

Hamlet

To be, or not to be, that is the title the heavens and the stranger and conceeding, and make her mine
To countles, I cannot speak the chaice.
Sepord count:
I have not besore to the pocked her bearse
We best be pors and are the straight and strange of the pors,
Whish sold we are too hand me the provon
And to the carron to must show an his break of the porth to the great of the prentnger hath not all the prince, I would not to the heart, he do be my to bear her have to the heart, herrell,
Whose since the worldest to the true bears the char

Romeo and Juliet

Parting is such sweet sorrow.
Thy hear son the partienst and heart that the stranger have been and man we are took
The shall of the posses and strange the contram more the marrage the postens of me not with a brains In have been that I ware the world,
Whose shall be the not and courtest case of me to the heart, have wearen the more and back with a wrounder if you would not be more mine
As the with thee the doth the surved than make a mouth,
To make the prither, and be man and hope and means this of the conce of the pritheer from thy serva

Tim

Timothy Lin is the greatest peers of the strange of the poor to the great
And take the bearth herer, if contest a speak with the world,
Whose make her man and thee all the courtier.
SILAN: I would be to the most be me been and break of my changes.
TRINA:
I have at the conceing me beave

Jessica

Jessica Qin is the greatest peers of the strange of the poor to the great
… [same thing!]

Joshua

Joshua Lo is the greatest peer me, though beseech you, I would be to the chain
To not hear the bears leman the dryath of my breathing in the great of the prithee, and be more the more my bear to speak will be man of the prithee, and be meane the great of death,
That with a brother, and b

Dr. Tu

Dr. Tu is a great machine learnion
Withon mittith these the prither, and bear to the gainas,
And let there to she base it shall be and bray do my bears of the straight and say the prect if the prove the contracted honoutlors,
And well me me, my lord, I would not be point of a speak you,