Data Generators for RNN Language Modeling

James Moody
2 min readJun 8, 2020

--

This article is a continuation of a previous article:

In the previous article, we discussed how to vectorize text data and store minibatches of this vectorized data as sparse tensors, for efficiency. In this article, we take a look at how to create a generator for feeding this data into an RNN.

The first step is to create a function to load a particular minibatch from storage into memory. In this case we are reading from a file path, but this could easily be modified to read from a database:

Recall that our sparse tensors were stored via their “shape” together with arrays for “indices” and “values”. We use tf.sparse.SparseTensor to reconstruct a Tensorflow SparseTensor from these three pieces of data, and then use tf.sparse.to_dense to convert it to a dense tensor. We assign the result to a variable “var” of the appropriate shape passed as an argument.

Now we can use this “load_minibatch” to create a batch generator:

The details of the generator will vary depending on the architecture of your RNN, but in this case we are expecting our data to be fed into an LSTM network, so in addition to the input sequences X and expected output sequences Y, we also have the initial states of the cell and hidden activations.

One important thing to point out here is that our input sequences X, which are sequences of vectorized words, have all had the zero vector prepended to them. The reason for this is that our language model is trying to predict the next word of the sentence, given the previous words. If we would like our language model to be able to guess the first word of a sentence, we don’t want to be inputting the first word of a sentence as the first sequential input to the RNN. Instead we input the zero vector (which doesn’t represent any word) as the first sequential input, and ask the RNN to guess the first word as the first sequential output. An example input sequence in X might look like [0, w1, w2, w3, …, w{n-1}], and the corresponding output sequence in Y would look like [w1, w2, w3, …, wn].

One quirk you might have noticed is that while X is an tensor of size (500, 30, 10002), Y is a list of length 30 of tensors of size (500, 10002). The reason for this is that our RNN model returns a sequence of outputs for technical reasons having to due with ease of sampling the resulting language model. But if you model returns a single tensor, you would modify this as necessary.

Finally, if you are using Keras, you can use this generator to train your model as follows:

The advantage of using a generator is that you can work with a large dataset that cannot fit into memory all at once.

Hope this gives you some good ideas!

--

--