Language Modelling and Text Generation using LSTMs — Deep Learning for NLP

Shivam Bansal
Mar 26, 2018 · 5 min read

With the latest developments and improvements in the field of deep learning and artificial intelligence, many exacting tasks of Natural Language Processing are becoming facile to implement and execute. Text Generation is one such task which can be be architectured using deep learning models, particularly Recurrent Neural Networks. An example of text generation is the recently released Harry Potter chapter which was generated by artificial intelligence.

Image for post
Image for post

Text Generation is a type of Language Modelling problem. Language Modelling is the core problem for a number of of natural language processing tasks such as speech to text, conversational system, and text summarization. A trained language model learns the likelihood of occurrence of a word based on the previous sequence of words used in the text. Language models can be operated at character level, n-gram level, sentence level or even paragraph level.

In this post, I will explain how to create a language model for generating natural language text by implement and training state-of-the-art Recurrent Neural Network. I will use python programming language for this purpose. The objective of this model is to generate new text, given that some input text is present. Lets start building the architecture.

First, import the required libraries

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import numpy as np

Lets use a popular nursery rhyme — “Cat and Her Kittens” as our corpus. A corpus is defined as the collection of text documents.

data = """The cat and her kittens
They put on their mittens,
To eat a Christmas pie.
The poor little kittens
They lost their mittens,
And then they began to cry.
O mother dear, we sadly fear
We cannot go to-day,
For we have lost our mittens."
"If it be so, ye shall not go,
For ye are naughty kittens."""

There will be three main parts of the code: dataset preparation, model training, and generating prediction. The boiler plate code of this architecture is following:

def dataset_preparation():
def create_model():
def generate_text():

In dataset preparation step, we will first perform Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus.

tokenizer = Tokenizer()def dataset_preparation(data):    corpus = data.lower().split("\n")    
total_words = len(tokenizer.word_index) + 1

Next, we need to convert the corpus into a flat dataset of sentence sequences.

    input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]

Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Before starting training the model, we need to pad the sequences and make their lengths equal. We can use pad_sequence function of Kears for this purpose.

    max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences,
maxlen=max_sequence_len, padding='pre'))

To input this data into a learning model, we need to create predictors and label. We will create N-grams sequence as predictors and the next word of the N-gram as label. For example:

Sentence: "they are learning data science"
they | are
they are | learning
they are learning | data
they are learning data | science

Lets write the code for the same

    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)

Perfect, now we can obtain the input vector X and the label vector Y which can be used for the training purposes. Recent research experiments have shown that recurrent neural networks have shown a good performance in sequence to sequence learning and text data applications. Lets look at them in brief.

Recurrent Neural Networks

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

Image for post
Image for post

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. To learn more about LSTMs, here is a great post. Lets architecture a LSTM model in our code. I have added total three layers in the model.

  1. Input Layer : Takes the sequence of words as input
def create_model(predictors, label, max_sequence_len, total_words):
input_len = max_sequence_len - 1
model = Sequential()
model.add(Embedding(total_words, 10, input_length=input_len))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam'), label, epochs=100, verbose=1)

Great, our model architecture is now ready and we can train it using our data. Next lets write the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

def generate_text(seed_text, next_words, max_sequence_len, model):
for j in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=
max_sequence_len-1, padding='pre')
predicted = model.predict_classes(token_list, verbose=0)

output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
seed_text += " " + output_word
return seed_text

Lets train our model using the Cat and Her Kitten rhyme.

X, Y, max_len, total_words = dataset_preparation(data)
model = create_model(X, Y, max_len, total_words)

Model’s Output when the the above model was trained on 100 epochs.

text = generate_text("cat and", 3, msl, model)
print text
>>> "cat and her lost kittens"
text = generate_text("we naughty", 3, msl, model)
print text
>>> "we naughty lost to day"

As we can see, the model has produced the output which looks fairly fine. The results can be improved further with following points:

  1. Adding more data

You can find the complete code of this article at this link. More Reading Links: Link1, Link2. I hope you like the article, please share your thoughts in the comments section.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store