Text Generation using Bidirectional LSTM and Doc2Vec models 3/3

Image Designed by ikatod / Freepik

If you have reached directly this page, I suggest to start reading:

Note: the notebook of this article is available on github.

It is now time to generate some text from the models we trained!

As a recap:

  • we trained a first bidirectional LSTM model to predict the next word of a given sequence of 30 words.
  • we train a doc2vec model for the whole input text as space, sentence based,
  • we trained a second bidirectional LSTM model to predict the best vectorized-sentence, following a sequence of 15 vectorized-phrases.

So, what will be the process of our text generation ? We have first to provide a seed of 15 sentences. then:

  1. using the last 30 words of the seed, we generate 10 candidates for the next sentence.
  2. we infer their vectors using the doc2vec model,
  3. we calculate the “best vector” for the sentence following the 15 phrases of the seed,
  4. we compare the infered vectors with the “best vector”, and pick-up the closest one.
  5. we add the generated sentence corresponding to this vector at the end of the seed, as the next sentence of the text.
  6. then, we loop over the process.

0. import libraries and parameters

In order to start, we have to import models we generated in previous parts.

So we load:

  • the vocabulary, containing all words of my text,
  • the doc2Vec model, used to generate vectors of a given sentence,
  • Two keras models: the one to predict words, the one to select sentences.

The next steps is to create some functions. I divide them is three categories:

  • functions to generate candidates of sentences, for a given sequence of words.
  • functions to select the best candidates in a set of sentences,
  • and a function to perform the whole pipeline : generate candidates of sentences, select the best one and loop.

1. Functions to generate Candidates of Sentences

I create four different functions:

  • sample() : used to select the next word of a given sequence of words,
  • create_seed() : useful to prepare the seed sequence of words at the beginning (tokenization, etc.)
  • generate_phrases() : used to create a sentence (sequence of words) based on previous words.
  • define_phrases_candidates() : used to generate a list of potential "next sentences" for a given sentence.

The sample() function is the same that the one used in the first part of this article : it is used to pick-up words from our vocabulary.

We will not take the words with the highest prediction but we will draw randomly a word from our vocabulary, based on its probability to be the next word (thanks to our first bidirectional LSTM Model).

As a recap: in order to tune this probability, we introduce a “temperature” to smooth or sharpen its value.

  • if temperature = 1.0, the probability for a word to be drawn is similar to the probability for the word to be the next one in the sequence (the output of the word prediction model), compared to other words in the dictionary,
  • if temperature is big (much bigger than 1), the range of probabilities is shorten: the probabilities for all words to be the next one will increase. More variety of words will be picked-up from the vocabulary, because more words will have high probabilities.
  • if temperature is small (close to 0), small probabilities will be avoided (they will be set to a value closed to 0). Less words will be picked-up from the vocabulary.

The create_seed() function is useful to prepare seed sequences, especially if the number of words in the seed phrase is lower than the expected number of words for a sequence (our first model request 30 words as inputs).

the function generate_phrase() is used to create the next phrase of a given sentence.

It requires as inputs:

  • the previous sentence,
  • the maximum number of words in the sentence we want to generate,
  • the temperature of the sample function.

If a punctuation "word" (".", "?", "!", ":", "…") is picked before the maximum number of the words is reached, the function ends.

the function define_phrases_candidates() provides a list of potential sentences, for a given previous sentence and a specific temperature.

2. Functions to select the best sentence

I create three different functions:

  • create_sentences() : used to create a sequence of words (list) from a single phrase,
  • generate_training_vector() is used to predict the best next vectorized-sentence for a given sequence of vectorized-sentences.
  • select_next_phrase() allows us to pick-up the best candidate for the next phrase.

the create_sentences() function generates a sequence of words (a list) for a given spaCy doc item.

It will be used to create a sequence of words from a single phrase.

the generate_training_vector() function is used to predict the next vectorized-sentence for a given sequence of vectorized-sentences.

The select_next_phrase() function allows us to pick-up the best candidates for the next phrase.

  • First, it calculates the vector for each candidates.
  • Then, based on the vector generated by the function generate_training_vector(), it performs a cosine similarity with them and pick the one with the biggest similarity.

3. Text generation — workflow

Now, I create a specific function generate_paragraph() : it combines all previous functions to generate text.

To be more detailed, this function works with the following parameters:

  • phrase_seed : the sentence seed for the first word prediction. It is a list of words.
  • sentences_seed : the seed sequence of sentences. It is a list of sentences.
  • max_words: the maximum number of words for a new generated sentence.
  • nb_words_in_seq: the number of words to keep as seed for the next word prediction.
  • temperature: the temperature for the word prediction.
  • nb_phrases: the number of sentences to generate in the process.
  • nb_candidates_sents: the number of candidates of sentences to generate for each new sentence.
  • verbose: verbosity of the script.

Now, we can perform the complete text generation workflow !

First, we have to define the sentences in the seed (15 sentences). In order to do that, I write 15 sentences that make sense. I create a small dialog between two of the main characters of my book. I expect the whole process to continue the dialog, the way I do it in my books, and maybe switch to a more descriptive paragraph.

We concatenate them in a single phrase and create the seed sentence using the create_seed() function:

Now, I run the script to generate the text:

It takes some time, depending on the number of sentences that must be generated, the number of candidates that must be created for each new sentence, etc. If you want to see the whole process, you can switch the verbose parameter to "True".

After a while, the text is generated, and it's easy to display it:

— oui , c’ est que ce que vous êtes à l’ attaque de ces monstres …
nolan se tourne vers mara qui se racle la gorge .
— c’ est un peu de temps !
panicaut se tourne vers silvi .
— c’ est vrai , renchérit lothar , c’ est une chose que vous êtes tous les trois porteurs …

And the result is… well, mixed feelings :

  • It seems the context of the text is correctly understood (it's still a dialog), but not the way I expected. Okay, it's still a dialog, but a bit too fuzzy for me. It’s just an intuition, I need to perform more test, and probably tune more my models to have something really good.
  • The selected sentences are definitely more accurate than if I had used only the first keras model to generate text words by words. At least, the second model seems to avoid sentences that do not mean a thing. Could be better, but it's clearly an improvement.
  • Definitely, the first model (generation of text words by words) can be improved. Probably using more data as inputs (from a bigger book !).

Conclusion

Well, we have now a test solution using different neural networks to generate text:

  • First, one neural network (bidirectional LSTM) is used to generate new sentences, words by words. The new sentences are produced using previous words in the text.
  • Using this neural network, we generate several sentences, which are candidates to be the next phrase of the existing text.
  • Then, we use a second neural network (bididrectional LSTM) to select the best candidate between all these sentences. The selection is done over several previous sentence in the text.

What can we do to go further ?

As you probably notice, the result of the neural networks trained during this exercice are not the best, for many reasons. We can probably have best results for each neural nets:

  1. by increasing the size of the RNNs. Using bigger value should be great (but huge impact on the training time),
  2. by increasing the number of epochs,
  3. Tuning the word2Vec model. I did not spent a lot of time on it, some improvement coud be done here,
  4. Of course, having more data. I only used the material I had regarding the book I wrote. Actually, I used more or less 450k words and about 50k sentences: it is probably not enough... Main issues are coming from the first model, I should test the process on a bigger book, or a set of books, at least three time bigger in term of words and sentences.

That’s all for the tests performed over this article. There are plenty of improvements to investigate, and I hope you enjoyed reading this…

However, please tell me:

  • if you find this article interesting,
  • if you have idea to improve the process, do not hesitate to use the data I provide on my github repository.(for french readers),
  • if you want to try this approach on another text (and share your results !),

Last but not least, if by any chance you want to read my fantasy book, you can find it on Apple iBook store and Amazon Kindle for free…

Thanks you!

Interested in AI, machine learning and data analytics. French writer ; fantasy and science fiction enthusiast.