Sentence Prediction Using a Word-level LSTM Text Generator — Language Modeling Using RNN
This projected was originally was for one of my clients on up-work. You can find the code on my Github repo. Unfortunately, It does not contain the data-set(corpus) on which I’ve trained the model due to some privacy reasons. But you can train it on any text corpus which you want.
Let's begin with the problem statement, so there is some XYZ company which deals with all sorts of repairing works related to electricity, plumbing everything that comes in a household. So they wanted to have a smart solution for their complaint section (where customers register there complains regarding repairing). When user type two or three words it comes up with the multiple suggestions of sentences, not words. The same way the keyboards in our cell phone gives suggestion of two to three words when we type something but here in our problem instead of two to three words we have to generate two to three sentences. So let's get started:
The data I got was very noisy, there were too many repetitive sentences and thousands of typos, misspellings, slang, incorrect punctuation. Like any other machine learning project, it was necessary to analyze, clean and perform some pre-processing of this data.
So preprocessing includes everything, removing redundant data, cleaning the data from misspellings and removing incorrect punctuation and also removing the words which do not appear very often (appears less than the minimum threshold we set). And after that removing the sentences that are too short or to long.
The idea is to train the RNN with many sequences of words and the target next_word. As a simplified example, if each sentence is a list of four words, then the target is a list of only one element, indicating which is the following word in the original text.
First what we do we read our processed corpus and split all the strings into tokens:
After this, we convert the whole text into text sequences of four words and after that, we count all the unique words and also count how many times a single word appeared in the corpus.
out: [‘hydrant’, ‘requires’, ‘repair’, ‘plumber’]
Then we map each and every unique word in our corpus to a unique integer thanks to the Tokenizer class from keras.preprocessing.text.
Out: [1776, 98, 58, 909]
After this, we convert this list (which contains all the sequences in the numeric form) into a numpy array.
Splits the sequences into inputs and output labels for our model. As sequence length was 4, we use the first three words as input and for that three words model will predict the word. The fourth word will be used as a label. After that, we convert our output labels into one-hot vectors i.e into combinations of 0’s and 1.
Side Note: Sometimes an out of memory error occurs in line 4 if you have very large data or very large vocabulary size, you can specify an extra argument in to_categorical(dtype = ‘float16’). But make sure your vocabulary size should not be greater than 65,535. Otherwise, it will not process because it increases the limit of the data type 2¹⁶ = 65536. In that case, you will have to use batching. You can learn about Keras batching here.
Out: array([1776, 98, 58])
Out: array([0., 0., 0., …, 0., 0., 0.], dtype=float32)
Finally, now its time to define our LSTM model. Now we build the RNN model. In this example, I used an embedding layer, two stacked LSTM layers with 50 units each.
Side Note About Embedding: Keras offers an Embedding layer that can be used for neural networks on text data.
It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.
The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.
It is a flexible layer that can be used in a variety of ways, such as:
It can be used alone to learn a word embedding that can be saved and used in another model later.
It can be used as part of a deep learning model where the embedding is learned along with the model itself.
It can be used to load a pre-trained word embedding model, a type of transfer learning.
Then comes the two fully connected or dense layers first layers have 50 units or neurons and the second dense layer which is our output layer have the number of units equal to the vocabulary size. As for every input, our model will predict the probability of every word in our vocabulary. I’ve experimented with a couple of optimizers such as RMSprop, SGD, and Adam with three different learning rates 0.01, 0.001 and 0.0001. In my case, Adam optimizer with a learning rate of 0.001 gave me the best results. As for the loss function, I’ve used categorical cross-entropy.
Finally, we are ready to train our model, I’ve added a callback ModelCheckpoint to save the weights after every epoch using keras.callbacks.
Then finally we fit our model with a batch size of 128 and 500 epochs. Then we also save the state of our tokenizer object, so that when we use this model to make predictions we don’t have to go through mapping and every other stuff that we have done previously.
Finally, after the training its time to make a prediction we will make two models. One model will give us only one sentence and the other model will give us three, four or up-to 10 sentences in suggestions. That model is implemented using beam search decoding.
So let's start with the simple model which gives us only one sentence. There is an important parameter that we need to define and that is the number of generated words in the output sentence, my client only wanted three words in output, but I also experimented with five and the results were good.
So the first thing we have to do is to load our saved model and the tokenizer object. After that I’ve defined a function gen_text() to generate the output, first, we encode our input string into integers with the help of our tokenizer object that we have loaded and after that, we check the length of the sequence. As you know our model is trained now and will only accept a sequence that will have three words. So if our input sequence contains more than three words we remove all the extra words from start and if the sequence length is less than three then we pad the required number of words at the start. The number which is padded is a reserved word with index 0. All this is done by pad_sequences() method from keras.preprocessing.sequence class.
Then we use the model.predict_classes() to predict the index of the highest probability and that index is then passed into the dictionary index_word from the tokenizer object. This way we get our predicted. After this, the predicted word is padded into the input sequence and then the process continues until the described number of generated words is achieved and at the end we get our predicted sentence.
Here are some of the results:
Input: hydrant requires repair
Output: hydrant requires repair is not working
Input: describe the problem
Output: describe the problem please attend to
Input: door and window
Output: door and window in the kitchen is not working in the
Input: machine is leaking
Output: machine is leaking and needs to be replaced
Input: modus to install
Output: modus to install and integrate not wifi
The model which predicts multiple sentences using beam search decoding will be discussed in the next story.
Here are some of the links which you also found useful:
- This project is highly based on this blog post
- Additional Readings:
- The Unreasonable Effectiveness of Recurrent Neural Networks
- A Brief Summary of Maths Behind RNN
- How many LSTM cells should I use?
- What’s the difference between a bidirectional LSTM and an LSTM?
- An Introduction to Dropout for Regularizing Deep Neural Networks
See you in the next story!