Word vs. Character Text Generation

Jake Tauscher
4 min readJul 27, 2020

--

Writing about GPT-3 last week (Open AI’s new NLP text generating model) got me thinking about text generating models. Specifically, I was interested in Open AI’s choice to use a word-based model. I had never written a word-based generative model before, only a character-based generative model.

So, I thought I would write a (much much more basic) model of my own, to compare word vs. character text generation models!

Note that these are very simplified examples — this is not intended to show precisely how GPT-3 works.

Word vs. Character Generation

When generating original text, or original anything, through code, one of the main choices you face is what “unit” you will use. In language, the two obvious choices are characters and words. Basically, your model will take in a list of characters or words, and try to predict what comes next. So, let’s say you had a sentence:

“My dog runs fast. He is a good boy.”

In a word generative model, your training data would look something like:

my dog runs -> fast | he is a-> good.

In a character generative model, your training data would look like:

my d -> o | g ru -> n | s fa -> s.

Intuitively, having read both of those, word generation seems like the obvious choice! It is way more similar to how our minds work. The character generation training data looks, frankly, like nonsense.

However, word generation has two downsides that I can see, compared to character generation.

The first downside to word generation models is that there is less training data. When you consider each character a separate unit, you will have 4–5x more training data (basically, the length of your average word) more “data units” to train your model. Depending on your dataset, this can be very important.

The second downside to word data is the number of potential inputs to the model. When training an AI model, we need to translate the words or characters to numbers. We do this simply, by mapping specific characters to numbers (A:1. B:2. Etc..). So, for a character generative model, you have ~40 potential inputs (the letters plus some punctuation, spaces, numbers). However, for a word generative model, you have thousands of inputs (every word!). In my word-based model below, I had 12,700 inputs. This makes the model much larger, as we need to learn a parameter for every one of those inputs. This makes the model harder to train.

Finally, a minor limitation to my word model is that you will lose all formatting (punctuation, lines, etc.). This is because it is a simple model that does not included punctuation in its tokens. You could include punctuation as tokens, if you wanted (and GPT-3 clearly can capture punctuation), but my approach is simpler.

However, as you have probably picked up, the limits of word generation are not limits for GPT-3. Limited training data? They trained on the whole internet. Model gets too big? They trained with unlimited processing power. So, it makes sense why they made that choice — but for me and my limited processing power, character generation might work better. I will let you be the judge!

The Model

A brief aside on the model. I used a basic LSTM structure, written in Pytorch, with 3 LSTM layers, a drop-out layer, and a fully connected layer. Really quickly on LSTMs — LSTMs are a form of RNN, or recurrent neural network. A RNN is a form of neural network in which the neurons also have a memory state. So, they can make predictions based not just off the current input, but the memory of prior input as well.

Basically, an RNN could receive the same input data, but at different times, and return a different prediction — a typical neural network would not do this.

Anyway, there is lots of great reading on the internet if you are interested in learning more about RNNs — I am certainly not an expert! But, what matters for us is that they are useful in predicting on time series data, as well as other series data, like our writing.

The Data

For this experiment, I thought it would be fun to train on the complete King James Bible. It is very long (lots of good training data!) and probably familiar to a lot of people.

Then, after having trained the model, I gave it a “seed” to get it started (in both the word and character model the seed was “The”), and then the model generated original text. See below for the results!

Output of the Word-Based Model
Output of the Character-Based Model

Pretty interesting! First, and most obviously, these are both nonsense. Nothing remotely divine came out of my computer. However, I think both have a cadence and word choice that is clearly reminiscent of the Bible.

And, as you can see, the character-based model did a great job of picking up the format of the training text, as it is split into numbered verses (even though those verses are out of order).

The word based model, while not as interesting visually, has the advantage of everything in there being a real word (I don’t know what “destrongeth the choldren of Israel” is trying to say in the character model).

Anyway, just a fun experiment, but interesting as a baseline for understanding text generation!

--

--