Generating Dinosaur Names With Pytorch

(The Deep-Learning Way)

Banjoko Judah
Analytics Vidhya
9 min readFeb 8, 2021

--

Introduction

I have been into Deep Learning for some time now, and a majority of my experience has been around Computer Vision tasks where the data I’ve worked with are mostly images. Some months ago, I began trying my hands on Sequential data (after a very long pause), and so far, it has been a great experience.

I then decided to write about the various tasks that I’ve worked on that uses this kind of data. And what better way to start than with one of the most basic and fun tasks to do; Generating Dinosaur names using a sequence-to-sequence model.

In this article, we will be training an RNN-based model to generate Dinosaur names. This article expects you to have a basic understanding of Python, RNN (Recurrent Neural Network) layers, and Pytorch, or any similar deep learning framework.

The complete code used in this article resides in this Github repository. With that, let’s get started.

Explanation

So, what exactly are sequence-to-sequence models? They are models that are commonly built upon RNN layers with the basic workflow of taking in a sequence of data and then generating a different sequence that is connected somehow to the input. For our task, the input is a sequence of characters, and the generated output is the next character in the input sequence.

Multiple diagrams of a next-character prediction model

In the image above, the series of squares (altogether) make up our model. The arrows below the model (representing our sequence of data) together with the leftmost arrow (the initial state of our model) form the input to the model. The ones above the model (which are the predicted next value), and the rightmost one (the current state of the model), make up the output of the model.

So, what do I mean when I say “the state of the model”? Well, you could think of it as an encoding of the information the model has been able to extract from the input. It is common to set the initial state of the model to zeros since it is yet to process any of the input data. And as the model processes more of the input (moving from left to right), its state is continually being updated (represented by the intermediate right-pointing arrows) until the end of the input data.

A diagram of a  combination of smaller next-character prediction model

The above image is a combination of the diagrams from the previous image. What this does is that instead of starting the entire process from the beginning for each output, we instead continue from where the previous one stops, since the hidden state is the same, and process everything at one go. Note that each of the output values is predicted based on only the history of the sequence before that point, just like it did previously.

Before we start, let’s import the packages we will be using.

Here is an overview of what we are going to be doing:

  1. Preprocessing data
  2. Converting and loading data
  3. Defining model
  4. Training model
  5. Generating samples
  6. Conclusion

Preprocessing

The first thing we’ll be doing is to get and preprocess our data. You can download the data we’ll be using here. It is a .txt file that contains each Dinosaur name on a new line. It is relatively clean, so preprocessing it is quite straightforward.

We start by loading the file, then we read and convert it to lowercase characters. Next, we split it by each line, which will give us a list of Dinosaur names. Finally, we split-up each name into a list of its characters and append an EOS (End Of Sequence) token (“<EOS>”) to the list.

The first name in the file “Aachenosaurus” will then be:

["a", "a", "c", "h", "e", "n", "o", "s", "a", "u", "r", "u", "s", "<EOS>"]

The reason we are adding the EOS token is for when we want to generate new names. The idea is that our model learns to end a name with this token, and when we are to generate new ones, this token will signal the end of that name. Without this token, we won’t know when our model is done generating a name, and we might end up stopping it too early or too late. If the last two sentences didn’t make much sense to you, hold on, they will later on.

After all of that, we create our vocabulary, which is a list of all the unique characters (and tokens) our model will be trained to handle. Then we create two dictionaries; the first one maps each item in our vocabulary to a unique integer while the other does the reverse. We will be using them soon.

Data Loading

Next on our TODO list is converting and loading our data. It is in two parts and involves first converting our data from strings (characters) to integers as our model understands only numbers. And the second part is the logic of how we will be loading our data to train our model. We will be using Pytorch’s Dataset and DataLoader class to handle both.

First, we define a Python class that inherits the Dataset class. This class will be responsible for fetching samples from our data. Next, we define the __init__ method, which takes in our already preprocessed data and converts each character to an integer using one of the dictionaries we created previously. Now our data is of a type our model can understand.

We are not done yet though. Pytorch’s Dataset class requires that we define two other methods. These are the __len__ method (which returns the length of our data), and the __getitem__ method (which returns a sample at a particular index in our data). Implementing the __len__ method is as simple as calling Python’s built-in function, len, on our data and returning the value. So that leaves us with the __getitem__ method.

The __getitem__ method receives the index we are interested in automatically as an argument. We then get the sample at that index in our data, slice out X (which is from the first value to the second value from the back) and Y (which is from the second value to the last value) from the sample, convert them to tensors and return them.

Lastly, we create a DataLoader object. This object provides a nice wrapper over our dataset that allows us to do (much easily) things like shuffling our dataset and more complex things like combining multiple datasets or using multiple workers to load our dataset. Luckily, we will not be dealing with any of the complex cases here. We finish by setting batch_size to 1 and shuffle to True.

With that, we are ready to work on our model.

Model

Our model is pretty standard. It starts with an embedding layer, followed by an LSTM layer (a variant of RNN layers), then a dropout layer, and finally, a Linear (fully connected) layer. I will describe briefly what each layer does.

Currently, the way characters are represented in our dataset just does not cut it. For example, a and b are represented as 1 and 2 respectively. This will suggest to our model that b is twice of a and that a is smaller than b. We know that for our current task, this is not true and does not even make any sense but that is what our model will make of it. A 1-dimensional (single number) representation is just not enough.

This brings us to the first layer of our model, the embedding layer. This layer is responsible for learning a multi-dimensional representation of each item in our vocabulary that best describes them to our model in the context of our current task. You can think of each dimension as being a feature.

The next layers, the LSTM layer together with the Linear layer, are responsible for predicting the next character given a sequence of characters, although the LSTM layer does most of the heavy lifting. You could also use other variants of RNN layers like GRUs or vanilla RNN, in-place of the LSTM layer if you wish; it does not matter much here.

After defining the layers, we move on to the forward method and what it does is to put everything we’ve just talked about together. It takes as input X and the previous state of our model, then it passes them through each of the layers and finally returns its prediction and the current state of our model. Each time we call our model (model()), this method is called implicitly. Now, remember that I mentioned at the beginning that we set the initial state of our model to zero, well, that is what the init_state method is for.

Training

Using our model as it is now will, of course, generate nonsense as currently, it has no idea what we want it to do. We have to train it first before it can become useful to us. The training process in Pytorch usually follows a similar pattern regardless of the task, and I will be running through them now.

  • First, we zero out the gradient of our model’s parameters
  • Then we initialize the state of our model and perform a forward pass through it
  • Next, we calculate the loss and make a backward pass through our model
  • Finally, we clip the gradients derived from the backward pass and update our model’s parameters
  • This cycle is then repeated severally.

We train our model with the Cross-Entropy loss, gradient clipping value of 0.25, and we use the Adam optimizer for updating our model’s parameters. We also keep track of the loss after each iteration and store them in a list. If you’re confused about the difference between an iteration and an epoch, you can read about it here.

Sampling

After training for 60,000 iterations, our loss is now reduced to about 1.1671.

We have finally arrived at the point we’ve been training for. It’s time to generate Dinosaur names. The way we sample our model is similar to the description of a sequence-to-sequence model I gave at the beginning of this article, except with a few modifications.

We start by initializing our model with a seed; this could be a single or a list of characters. Then we take the output of the last timestep from our model and randomly chose a character from the top-k most probable characters. The chosen character, together with the current state of the model, is then used to predict the next value in the sequence. This process is then repeated until the randomly chosen character is the EOS token or until it reaches a length that we specify.

The reason we chose characters at random is to introduce some sort of diversity and randomness into the samples we generate. If we selected the most probable character each time, we might get stuck in a loop where the model keeps generating a sequence of characters repeatedly.

Let’s see examples of this.

This outputs:

>>> Samples where the seed is a randomly chosen character.n => nicronyx<EOS>       r => reptitan<EOS>
b => bitcodon<EOS> t => tariusaurus<EOS>
g => goblinodon<EOS> v => viceratops<EOS>
a => anteletops<EOS> w => wagnoraptor<EOS>
o => optimimus<EOS> w => walkiesaurus<EOS>

And, according to our model, if there is a Python-like Dinosaur, it would probably be called any one of these:

>>> Samples where the seed is a list of characters.python => pythonyx<EOS>
python => pythonykus<EOS>
python => pythongovenator<EOS>

Conclusion

In this article, we talked about sequence-to-sequence models and the processes of using them to generate Dinosaur names. The methods we discussed here are quite general and can be used in generating anything like a human name, song lyrics, cryptocurrency name, and even music.

One exciting idea you could try out is a conditioned generation of Dinosaur names. It is similar to what we just did except that instead of randomly coming up with a name, we consider its features like if it’s a carnivore or if it flies or if it’s aquatic, and based on this information, we generate a much more appropriate name.

Another idea you should consider doing is batching your data which i talked about in this article. It has the effect of reducing your training time and increase the chance of your model converging early.

Thanks for reading!

--

--

Banjoko Judah
Analytics Vidhya

I am a writer, Python developer and an AI practitioner. Super interested in AR & VR and can't wait to see the changes they'll bring to our daily lives.