Text Generation using Bidirectional LSTM and Doc2Vec models 1/3

The purpose of this article is to discuss about text generation, using machine learning approaches, especially Recurrent Neural Networks (RNN).
Image Designed by ikatod / Freepik

It is not the first article about it, and probably not the last. Actually, there is a lot of literature about text generation using “AI” techniques, and some codes are available to generate texts from existing novels, trying to create new chapters for great success like ”Game of Thrones”, ”Harry Potter”, or a complete new theater scene in the style of Shakespeare. Sometimes with interesting results.

Mainly, these approaches are using standard RNN such as LSTM (Long Short-Term Memory), and the are pretty fun to be experimented.

However, generated texts have a taste of unachievement.

Generated sentences seem to be quite right, with correct grammar and syntax, as if the neural network was understanding correctly the structure of a sentence. But the whole new text does not have a great sense. And sometimes, has complete nonsense.

This result could come from the approach itself, using only LSTM to generate text, word by word. But how can we improve them ?

In this article, I will try to investigate a slightly different way to generate sentences in a text generator solution. It does not mean that I will use something completely different than LTSM : I am not. I will use LTSM networks to generate sequences of words. However I will try to go further than a classic LSTM neural network and I will use an additional neural network (LSTM again), to select the best phrases in the text generation.

Seems confusing ? This article is also a tutorial (in addition to be an experiment), so I will provide step by step insights about what I am trying to do.

It will described :
 1. how to train a neural network to generate sentences (i.e. sequences of words), based on existing novels. I will use a bidirectional LSTM Architecture to perform that.
 2. how to train a neural network to select the best next sentence for given paragraph (i.e. a sequence of sentences). I will also use a bidirectional LSTM architecture, in addition with a Doc2Vec model trained on the same target novels.

Note about Data inputs:
Regarding training data, I will not use texts which are not free in term of intellectual properties. So I will not train a solution to create a new chapter of ”Game of Throne” or ”Harry Potter”.
Sorry about that, there is plenty of “free” texts to perform such exercices and you can dive into the Gutemberg project, which provides huge amount of texts (from William Shakespeare to H.P. Lovecraft, or other great authors).
However, I am also a french author of fantasy and Science fiction. So I will use my personal material, hoping it can help me in my next work!
So, I will base this exercice on ”Artistes et Phalanges”, a french fantasy novel I wrote over the 10 past years, which I hope will be fair enough in term of data inputs. It contains more than 830 000 characters, including spaces.
By the way, if you’re a french reader and found of fantasy literature, you can find it on Apple iBook store and Amazon Kindle for free… Please note I provide also the data for free on my github repository. Enjoy it!

1. Neural Network for Generating Sentences

Let's start! For the time being, we'll perform a "standard" text generator. The objective of this first step is to generate sentences in the style of a given author.

As LSTM networks are working well for this job, we will use them.

Note: the purpose of this article is not to deep dive into LSTM model description, you can find very great article about them and I suggest you to read this article from Andrej Karpathy.

You can also find existing code to perform text generation using LSTM. For example, on my github, you can find two small tutorials, one using Tensorflow, and another one using Keras, which is easier to understand.

For this first part of these tutorial, I will re-use these materials, but with few improvements :
 — Instead of a simple LSTM, I will use a bidirectional LSTM architecture. This network configuration converges faster than a single LSTM, and from empiric tests, seems better in term of accuracy. You can have a look at this article from Jason Brownlee, for a good tutorial about bidirectional LSTM.
 — I will use Keras, which require less effort to create the network of is more readable than conventional Tensorflow code.

What is the neural network task in our case ?

LSTM (Long Short-Term Memory) are very good for analyzing sequences of values and predicting the next one. For example, LSTM could be a good choice if you want to predict the very next point of a given time series.

Talking about sentences in texts ; the phrases (sentences) are basically sequences of words. So, it is natural to assume LSTM could be useful to generate the next word of a given sentence.

In summary, the objective of our LSTM neural network will be to guess the next word of a given sentence (i.e. a sequence of words).

I give you a short example:
What is the next word of this following sentence : “the man is walking down” ?
Given a dictionary containing all potential words, our neural network will take the sequence of words as seed input : 1: “the”, 2: “man”, 3: “is”, …
Its output will be a matrix providing the probability for each word from the dictionary to be the next one of the given sequence.
Based on the training data, it could maybe guess the next word will be "the"…
Then, how will we generate the whole text ? Simply by iterating the process. Once the next word is drawn from the dictionary, we add it at the end of the sequence. Then, we guess a new word for this new sequence… ad vitam aeternam!
steps to generate sentences

In order to do that, we will:

  1. read the data (the novels we want to use as inputs),
  2. create the dictionary of words,
  3. create the list of sentences, which are the inputs of our neural network,
  4. create the neural network,
  5. train the neural network,
  6. generate new sentences!

1.0 Import Libraries and define few parameters

First, we import the librairies will be used for the tutorial. There are few of them, especially:

  • Keras: to design the network architecture and train the model. It works over Tensorflow,
  • Spacy: We will have to deals with our raw input texts: to tokenize them, to modify them, etc. Today, each time I have to process texts in python, I use the spacy library which is incredible. There are so many things you can do with it (train a Named Entities extraction solution for example). However, for this tutorial, I will only use very few options from Spacy: mainly it's tokenizer.
  • numpy,
  • and other useful librairies (sys, os, time, etc.)

I also define some parameters, like the directory where I store the raw texts of the novels, the directory to save my trained Neural Network models, etc.

Please note that, for this tutorial, I divided my novels within txt files, one per chapter. They are all stored in the same folder.

1.1 Read Data

At the beginning, I need to create a specific function to read a list of words from my texts. I use Spacy library to retrieve the words using its tokenizer, keep them in small letters, and removing all carriage returns (\n).

I am doing that because I want to reduce the number of potential words in my dictionary, and I assume we do not have to keep capital letters. Indeed, they are only part of the syntax of the text, it’s a "shape", and do not deals with its sense.

Then, I build a list (wordlist), containing all the words of my texts:

Now the whole novel and its several chapters are transformed into a single list of words, wordlist, we can create the dictionary.

1.2 Create the dictionary

The dictionary is the list of all words contained in texts, without duplicates. In addition, for each word, we will assign it an index.

1.3 Create Sentences List

Now, we have to create the training data for our LSTM. We create two lists:

  • sequences: this list contains the sequences of words (i.e. a list of words) used to train the model,
  • next_words: this list contains the next words for each sequences of the sequences list.
In this exercice, we assume we will train the network with sequences of 30 words (seq_length = 30).

How it works: to create the first sequence of words, we take the 30th first words in the wordlist list. The word number 31 is the next word of this first sequence, and is added to the next_words list.

Then we jump by a step of 1 (sequences_step = 1 in our example) in the list of words, to create the second sequence of words and retrieve the second “next word”.

We iterate this job until the end of the list of words.

Iterating over the whole list of words, we created 175570 sequences of words, and retrieve, for each of them, the next word to be predicted.

However, these lists cannot be used “as is”. If we want them to be ingested by the LSTM, we have to transform them. Indeed, "text" will not be understood by a neural net, we have to use digits.
However, we cannot only map a words to its index in the vocabulary, because this value does not "represent" the word. We have to reorganize a sequence of words as a matrix of booleans (1 or 0).

Create training data for our LSTM

So, we create the matrix X and y to be the data inputs of our model:

  1. X : the matrix of the following dimensions:
  • number of sequences,
  • number of words in sequences,
  • number of words in the vocabulary.

2. y : the matrix of the following dimensions:

  • number of sequences,
  • number of words in the vocabulary.
     
    For each word, we retrieve its index in the vocabulary, and we set to 1 its position in the matrix. X and y are our training data.

1.4 Build the Bidirectional LSTM Model

Now, here comes the fun part: the creation of the neural network.
As you will see, I am using Keras that provides a very good abstraction to design a Neural Network architecture.

In this example, I create the following neural network:

  • bidirectional LSTM,
  • with size of 256 and using RELU as activation. 256 could be small, but for the tutorial, it's fair enough,
  • then a dropout layer of 0,6 (it’s high, but necessary to avoid quick divergence)

The network should provide me a probability for each word from the vocabulary to be the next one after a given sentence. So I end it with:

  • a simple dense layer of the size of the vocabulary,
  • a softmax activation.

I use ADAM as optimizer and the loss calculation is done on the categorical cross-entropy.

Here is the function to build the network:

Then, I define the few required parameters by the above function and create the neural network:

1.5 Train the Model

We still have to train the model. We shuffle the training set and extract 10% of it as validation sample. we also set callback to save automatically the training after 2 periods, and implement early stopping when the loss for the validation data did not improve after 4 epochs.

We simply run :

1.6 Generate Sentences

Great !
After few hours of training, we have now trained a model to predict the next word of a given sequence of words.

Few remarks regarding the results (stopped after 9 epochs):

  • the loss drop from 7.3 to 3.1, the accuracy is around 40%,
  • the val_loss is around 5.9 with accuracy around 13.5%.

Not “amazing”, but it should be fair enough for our test. Indeed, for a given sequence of word, there is no clear determinism in the word to be chosen. So we have to be careful not only pick-up the word with the biggest probability, but being able to choose another one with high probability too.

So, in order to generate text, the task is now pretty simple:

  • we define a “seed” sequence of 30 words (30 is the number of words required by the neural net for the sequences),
  • we ask the neural net to predict the probability of each word from the dictionary to be the word number 31,
  • we choose the word number 31,
  • then we update the sequence by shifting words with a step of 1, adding words number 31 at its end,
  • we ask the neural net to predict the probability of each word from the dictionary to be the word number 32,
  • etc… For as long as we want.

Doing this, we generate phrases, word by word. Here is how we can script it:

First we load the vocabulary and the model previously trained:

To improve the text generation, and tune a bit the word prediction, we introduce a specific function to pick-up words from our vocabulary.

We will not take the words with the highest prediction (or the generation of text will be boring), but we would like to insert some uncertainties, and let the solution, sometime, to pick-up words with less good prediction.

That is the purpose of the function sample(), that will draw randomly a word from our vocabulary.

However, the probability for a word to be drawn will depends directly on its probability to be the next word, thanks to our first bidirectional LSTM Model.

In order to tune this probability, we introduce a “temperature” to smooth or sharpen its value.

  • if temperature = 1.0, the probability for a word to be drawn is similar to the probability for the word to be the next one in the sequence (the output of the word prediction model), compared to other words in the dictionary,
  • if temperature is big (much bigger than 1), the range of probabilities is shorten: the probabilities for all words to be the next one will increase. More variety of words will be picked-up from the vocabulary, because more words will have high probabilities.
  • if temperature is small (close to 0), small probabilities will be avoided (they will be set to a value closed to 0). Less words will be picked-up from the vocabulary.

Now, we just have to let the solution define by itself each new word and print it:

After tuning the temperature (0.33 seems working well in our case): voilà! For the seed "nolan avance sur le chemin de pierre et grimpe les marches .", the script print the following paragraph (30 new words):

nolan avance sur le chemin de pierre et grimpe les marches.
— oui, je vais vous expliquer.
lothar se retourne et jette un œil au jeune homme. le jeune homme se redresse, et s’ assoit sur le [...]

Even with the early stopping and low accuracy, the results look great and readable, with understandable syntax and grammar.

Conclusion

Great! we have now a neural network and some scripts capable of generating text on demand!

As you probably noticed, the raw result of the neural networks trained during the tutorial is not amazing... We can probably have better results by increasing the size of the RNN, tuning the model to limit variance, etc.

Applying these modifications could increase the capacity for the neural network to generate good phrases, less fuzzy.

So, yes, LSTM can be use to generate not-so-bad text, without great instruction, bu we cannot say that it is a real text, with a global meaning. Regarding the sense of a paragraph (even with a network trained longer than in this tutorial), we are far behind something readable. After few words, sentences does not mean anything, as a whole.

Why ?

Because an author is not working like that when is writting a novel. A novel is not just an endless sequences of words (it will be too easy in the other way!).
If we can guess the next word of a phrase, it does not mean we are able to guess the complete purpose of a paragraph. Or a chapter.

In many case, an author writes with a purpose, he or she has a target where he wants to bring the reader.. And this target cannot be explained only from the sequence of previous words.

How can we tackle this ?

For the time being, we will try only to improve the generation of sentences, by detecting patterns in the sequences of sentences in the novel, not only in the sequences of words.

It could be an improvement, because doing that, the context of a paragraph (is it a description of a countryside? a dialog between characters? which people are involved? what are the previous actions? etc.) could emerge and can be used to select wisely the next sentence of the text.

However, there is a difficulty. As the number of words in a sentence can vary, we are not able to use the same technic than the previous one to generate text. At least, not only. We will still use the previous model to generate complete sentences (i.e. list of words until a punctuation : “.”, “?”, etc.).

But we will vectorize all sentences of the text, and try to find patterns in sequences of these vectors. In order to do that, we will use an unsupervized approach with Doc2Vec.

That's what we'll try to do in the next part of the tutorial…

Thanks for reading !