Automatically Generate Hotel Descriptions with LSTM
How to create a generative model for text using LSTM recurrent neural networks in Python with Keras
In order to build a content based recommender system, I collected hotel descriptions for 152 hotels in Seattle. I was thinking of some other ways to torture this good quality clean data set.
Hey! why not train my own text-generating neural network of hotel descriptions? That is, create a language model for generating natural language text (i.e. hotel descriptions) by implement and training a word-based Recurrent Neural Network.
The objective of this project is to generate new hotel descriptions, given some input text. I do not expect results to be accurate, as long as the predicted text are coherent, I will be happy.
And thanks for this tutorial from Shivam Bansal that helped me getting through the exercise.
The Data
We have 152 descriptions (i.e. hotels) in total in our data set.
Have a peek the 1st description:
corpus = [x for x in all_descriptions]
corpus[:1]
After tokenization, we then can:
- Explore a dictionary of words and their counts.
- Explore a dictionary of words and how many documents each appeared in.
- Explore an integer count of the total number of documents that were used to fit the Tokenizer (i.e. total number of documents).
- Explore a dictionary of words and their uniquely assigned integers.
t = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)
t.fit_on_texts(corpus)print(t.word_counts)
print(t.word_docs)
print(t.document_count)
print(t.word_index)
print('Found %s unique tokens.' % len(t.word_index))
Text Pre-processing
Tokenization
We use Keras’ Tokenizer to vectorize text descriptions,
- We remove all punctuation.
- We turn the texts into space-separated sequences of words in lowercase.
- These sequences are then split into lists of tokens.
- We set
char_level=False
, so every word will be treated as a token other than character. - The lists of tokens will then be indexed or / and vectorized.
- We convert the corpus into sequence of tokens.
The above lists of integers represent the ngram phrases generated from the corpus. For example, lets say, a sentence of “located on the southern tip of lake Union” be represented by the index of words like this:
Pad sequences and create predictors and label
- Pads sequences to the same length
- Pad sequences transforms lists of integers into a 2D Numpy array of shape
(num_samples, maxlen)
. - The predictors and label look like this:
As you can see, if we want accuracy, its going to be very very hard.
Modeling
We can now define our single LSTM model.
- A single hidden LSTM layer with 100 memory units.
- The network uses dropout with a probability of 10.
- The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 3420 words between 0 and 1.
- Our problem is a single word classification problem with 3420 classes and as such is defined as optimizing the log loss (cross entropy), and use the ADAM optimization algorithm for speed.
- There is no test data set. We are modeling the entire training data to learn the probability of each word in a sequence.
- According to Keras documentation, At least 20 epochs are required before the generated text starts sounding coherent. so, we will train for 100 epochs.
Generating text using trained LSTM network
- A this point, we can write a function that takes seed texts as input, and predict the next words.
- We tokenize the seed texts, pad sequences and pass them to the trained model.
Try it out!
- I randomly choose “hilton seattle downtown” as seed texts, and I want model to return me the next 100 words.
print(generate_text("hilton seattle downtown", 100, model, max_sequence_len))
- I choose “best western seattle airport hotel” as seed texts, and I want the model to predict the next 200 words.
print(generate_text("best western seattle airport hotel", 200, model, max_sequence_len))
- I choose “located in the heart of downtown seattle” as seed texts, and I want the model to predict the next 300 words.
print(generate_text('located in the heart of downtown seattle', 300, model, max_sequence_len))
Conclusion
- There was no misspelling.
- The sentences look realistic.
- Some phrases get repeated again and again, in particular predicting a larger number of words as output for a given seed.
A few thoughts on improvements: more training data, more training epochs, more layers, more memory units to the layers, predict fewer number of words as output for a given seed.
Jupyter notebook can be found on Github. Enjoy the rest of the weekend!
References: