Text Generation using Bidirectional LSTM and Doc2Vec models 2/3

Image Designed by ikatod / Freepik

If you have reached directly this page, I suggest to start reading the first part of this article. It describes how to create a RNN model to generate a text, word after word.

I finished the previous part explaining I will try to improve the generation of sentences, by detecting patterns in the sequences of sentences, not only in the sequences of words.

It could be an improvement, because doing that, the context of a paragraph (is it a description of a countryside? a dialog between characters? which people are involved? what are the previous actions? etc.) could emerge and can be used to select wisely the next sentence of the text.

The process will be similar to the previous one, however, I will have to vectorize all sentences in the text, and try to find patterns in sequences of these vectors.

In order to do that, we will use Doc2Vec.

Note: the notebook of this article is available on github.

1. Doc2Vec

In a short, we will transform each sentences of our text in a vector of a specific space. The great thing of the approach is we will be able to compare them ; by example, to retrieve the most similar sentence of a given one.

Last but not least, the dimension of the vectors will be the same, whatever is the number of words in their linked sentence.

It is exactly what we are looking for: I will be able to train a new LSTM, trying to catch pattern from sequences of vectors of the same dimensions.

I have to be honest: I am not sure we can perform such task with enough accuracy, but let’s have some tests. It is an experiment, at worst, it will be a good exercice.

So, once all sentences will be converted to vectors, we will try to train a new bidirectional LSTM. It purpose will be to predict the best vector, next to a sequence of vectors.

Then how will we generate text ?

Pretty easy: thanks to our previous LSTM model, we will generate sentences as candidates to be the next phrase. We will infer their vectors using the trained doc2Vec model, then pick the closest one to the prediction of our new LSTM model.

1.1 Create the Doc2Vec Model

Doc2Vec assumes its input to be a list a words, with a label, per sentence:

Example: ['tobus', 'ouvre', 'la', 'porte', '.'] LABEL1

So we have to extract from the text each sentences and splits their words.

by convention, I assume a sentence ends with “.”,”?”,”!”,”:” or “…”. The script reads each text, and create a new sentence each time it reaches on of these characters.

First, we load the Doc2Vec library, we load our data and set some parameter:

  • all texts are stored in the data_dir directory,
  • the file_list list contains the names of all text files in the data_dir directory,
  • the save_dir will be used to save models.

I create the list of sentences for the doc2vec model: to split easily sentences, I use the spaCy library. Then, I create the a list of Labels for these sentences.

1.2 Train doc2vec model

I also create a specific function to train the doc2vec model. Its purpose is to update easily training paramaters:

few notes regarding the parameters of the function: the default parameters have been chosen empirically.

Now, it's time to train the doc2vec model. Simply run the command:

Here are some insights for the used parameters:

  • dimensions: 300 dimensions seem to work well for classic subjects. In my case, after few tests, I prefer to choose 500 dimensions,
  • epochs: below 10 epochs, results are not good enough (similarity is not working well), and bigger number of epochs creates too vectors with less differences. So I choose 20 epochs for the training.
  • min_count: I want to integrate all words in the training, even those with very few occurence. Indeed, I assume that, for my test, specific words could be important. I set the value to 0, but 3 to 5 should be OK.
  • sample: 0.0. I do not want to downsample randomly higher-frequency words, so I disabled it.
  • hs and dm: Each time I want to infer a new vector from the trained model, for a given sentence, I want to have the same output vector. In order to do that (strangly it’s not so intuitive), I need to use a distributed bag of words as training algorithm (dm=0) and hierarchical softmax (hs=1). Indeed, for my purpose, distributive memory and negative sampling seems to give less good results.

2. Create the Input Dataset

Note: I do not use vectors generated during the training, because I want to compare them to vectors infered for sentences the model did not seen. It’s better to generate them in the same way.

Now, in order to create the Keras input data set (X_train, y_train), we have to folow these guidelines:

  • 15 sequenced vectors from doc2vec as input,
  • the next vector (16th) as output.

So, the dimension of X_train must be (number of sequences, 15, 500) and the dimension of y_train: (number of sequences, 500)

3. Create the Keras Model

First, we load the library and create the function to define a simple keras Model:

  • bidirectional LSTM,
  • with size of 512 and using RELU as activation,
  • then a dropout layer of 0,5.

The network will not provide me a probability but directly the next vector for a given sequence. So I finish it with:

  • a simple dense layer of the size of the vector dimension.

I use ADAM as optimizer and the loss calculation is done using logcosh.

Then we create the model:

And we train it:

Great ! After few minutes, we have a model to predict the next best sentence vector for a given sequence of sentences.

Few remarks regarding the results:

  • the loss drop to 0.1073, the accuracy is around 12.5%,
  • the val_loss is around 0.1116 with val accuracy around 9%.

4. Conclusion

However, let's check if the exercice is good enough to select the best next sentence of a text. I hope it will be fair enough for my test : for a given sequence of sentences, there is no clear determinism in the sequence to be chosen. Indeed, the next sentence is not obvious and could be slightly different. My objectif is only to help the selection of the best next sentence from a list of candidates sentences.

In order to test that, we have to, for a given sequence of sentences:

  • generate, using our first LSTM model, different candidates of sentences,
  • Infer their vectors using our doc2vec model,
  • Generate, using our second LSTM model, the best next vector,
  • then select the most similar vector.

That’s what I’ll try to do in the next part of this experiment… the next and last part is available here.

Thanks for reading !

Interested in AI, machine learning and data analytics. French writer ; fantasy and science fiction enthusiast.