A nostalgic trip down Autoregressive / Timeseries RNN lane — Predictive Text Generation

Published in

exemplifyML.ai

5 min readSep 18, 2022

Text generation using autoregressive / timeseries RNN variants trained on IMDB movie reviews

Autoencoders:

An auto-encoder is a type of neural network used for learning unlabeled data, i.e a type of network used for unsupervised learning.
In a very general sense, auto-encoders work by taking the input (x) and attempt to replicate the input with some constraints, which allows them to learn the characteristics of the input.
These models then can be used in cases like image sorting etc.

Figure 1 — High level view of Encoder / Decoder functions (Image by Author)

A great analogy for auto-encoders is the well-known PCA encoder which reduces the dimensions of the input.
In order to reduce the dimensions along with the goal of minimizing loss of input traits, it has to select/pick the features which contribute a large portion towards the characteristics of the input.

Autoregressive RNN Models:

An auto-regressive RNN is a model which is trained by comparing each timestep (t) with the the input at timestep (t +1).
In other words, the input at timestep (t+1) becomes the label for timestep (t) when computing the loss at timestep (t).
Similar to the auto-encoders, during inference time, the prediction generated at timestep (t) becomes the input for the timestep (t+1).

These models are specifically designed for time-series problems, like chatbots or language translation tasks(Seq2Seq models). It is called an auto-regressive model, as each timestep output is dependent on the previous inputs fed into the RNN.

Figure 2 — Basic architecture of a many to many autoregressive RNN in training mode (Image by Author)

In our example, we will train an auto-regressive model to generate text based on an input seed.
The model could either be a character based model, or word based model, which have their own benefits and tradeoffs.
I chose a word embedding based model for faster training at the expense of memory, as the vocabulary sizes can get huge.
For small footprint, character based models are better, though they take a longer time to train. In addition, they generalize better across languages.

Figure 3 — Autoregressive RNN architecture — prediction at each timestep based on previous inputs plus stochastic variables (Image by Author)

Autoregressive RNN in Pytorch:

Pytorch has inbuilt RNN cells which we can leverage for building our own layered autoregressive RNNs.

For this tutorial, we will use the IMDB reviews dataset from Kaggle, for generating a movie review from the trained model.

Autoregressive LSTM model — Based off the examples in the book ‘Inside Deep Learning’

Steps:

Follow the same steps as outlined in my previous post — A nostalgic trip down RNN Lane — Rudimentary Text Classification using Multi GPUs
For the dataset class, as we are using the next character or word in the sequence as the label, minor changes are required for the loader as outlined below.
1. After converting the input sentence to character or word level vectors, concatenate the all the vectors together.

self.idx_encoded_reviews_concat.extend(encoded_input_review)

2. For the length of the items in the dataset, return length of the concatenated encoded list divided by the chunk size. The chunk size is commonly known as ‘sequence length’

def __len__(self):
    ((len(self.encoded_reviews_concat) - 1) // self.seq_length)

3. For each ‘__getitem__’ call of the dataset, retrieve and return a chunk of the concatenated list of input vectors.
Note for the label to be returned from the ‘__getitem__’, the input is offset by one and returned as the labels for the input.

import torch

def __getitem__(self, idx):

    start = idx * self.seq_length

    return (torch.tensor(self.encoded_reviews_concat[start:start + self.seq_length]),
            torch.tensor(self.encoded_reviews_concat[start + 1: start + 1 + self.seq_length]))

4. After training, prediction at each timestep is generated by using the predicted output at the previous timestep.

Figure 4 — Inference timesteps for an autoregressive RNN (Image by Author)

5. Hyper parameters and other arguments used :
- embedding size: 512
- LSTM hidden size: 1024
- Number of LSTMCell layers: 4
- Batch size: 128
- Sequence/Chunk length: 32
- Dataset — Kaggle sourced IMDB reviews — 8K from each sentiment class (positive, negative)

Text Generation Output(Inference):

Text generation(predictions) based on the seed — ‘Movie is great.’

Note: As the sequence length increases, the predictions also not as coherent due to the fact that the predictions at the nᵗʰ time-step are based on all the previous predictions.
As this model was trained on positive and negative reviews, it displays characteristics from both the sentiment classes.

A fictional sample output from the trained autoregressive LSTM is illustrated below.

movie is great . if you want to see an example of what great acting is all about , and be hugely entertained all the while , then i encourage you to see the dresser . i won ' t comment on the story , simply because i don ' t want to ruin it for a very long time . i do remember it for the one minutes of the cinema here . the first 40 minutes were very funny and funny . i would have been inclined to give this a zero if i could , because they didn ' t even have the guts to call it by it ' s full name ' . i have seen better singaporean movies than this . chicken rice war was good .
the fact that i did not like the music is a very personal opinion , the historical innacuracies are not . i do realize that it is an opera and not a documentary .
...
...

Visualization of sentence chunk clusters:

As I had selected a small subset of 1000 samples from the input chunks, and the number of clusters were not known, I decided to go with Affinity propagation clustering algorithm.

Sklearn has a good reference on the different types of clustering algorithms — here. They also have a map for picking the right algorithm for classification, clustering, dimensionality reduction — here.

Illustrated below is the clustering based on Affinity propagation on the 1000 sampled chunks from the dataset.

Figure 5 — Clustering of the review chunks scaled down to 2 dimensions (Image by Author)

Looking at the cluster —30 in the figure above, some of the sentences look similar, at least on the context level; there is also an entity name in those chunks.

Figure 6 — Some sample sentences from the chunks assigned to cluster 30 (center red cluster) (Image by Author)

References: