NLP Word Prediction by Using Bidirectional LSTM

Mikdat Yücel
Analytics Vidhya
Published in
10 min readFeb 1, 2021

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

We will talk about how to predict the next word in your poem or story. And we will show it with a python implementation.

# first off all we imported libraries which we needimport tensorflow as tffrom tensorflow.keras.preprocessing.sequence import pad_sequencesfrom tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectionalfrom tensorflow.keras.preprocessing.text import Tokenizerfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.optimizers import Adamimport numpy as np

Preprocessing

Before beginning to process the data, there are few concerns to be addressed in the dataset we gathered. These are simple cleaning procedures which makes it easier to use the data in subsequent steps. But Tensorflow usually implements it for us. The following are few preprocessing steps usually done.

  • striping white spaces
  • lower-case conversions
  • removing numbers
  • removing punctuation
  • removing unwanted words
  • removing non-English words

Tokenization

One of the important normalization methods is called tokenization. It is simply segmenting the continuous running text into individual segments of words. One very simple approach would be to split inputs over every space and assign an identifier to each word. For example,

Tokenizing “It not cool that ping pong is not included in Rio 2016” would produce;

# We read data and specified Tokenizertokenizer = Tokenizer()data = open('/tmp/irish-lyrics-eof.txt').read()# We splited the data by whitespaces and created sentence list corpus = data.lower().split("\n")# with "fit_on_text" method tokenized each sentence in the corpustokenizer.fit_on_texts(corpus)# Afterwards we created word index specified unique number of words in the corpus by assign index number each of them
# like this;
{'car': 1, 'prison': 2, 'him': 3, 'welcome': 4}
total_words = len(tokenizer.word_index) + 1 # we'll add one to this to consider vocabulary words# This is a key value pair with the key being the word and the value being the token for that word
print(tokenizer.word_index)
print(total_words)

Pad Sequences

Even after converting sentences to numerical values, there’s still an issue of providing equal length inputs to our neural networks — not every sentence will be the same length! There are two main ways you can process the input sentences to achieve this — padding the shorter sentences with zeroes, and truncating some of the longer sequences to be shorter. In fact, you’ll likely use some combination of these.

With TensorFlow, the pad_sequences function from tf.keras.preprocessing.sequence can be used for both of these tasks. Given a list of sequences, you can specify a maxlen (where any sequences longer than that will be cut shorter), as well as whether to pad and truncate from either the beginning or ending, depending on pre or post settings for the padding and truncating arguments. By default, padding and truncation will happen from the beginning of the sequence, so set these to post if you want it to occur at the end of the sequence.

So let's look at the code to take this corpus and turn it into training data. Here’s the beginning, We will unpack this line by line. First off all our training axis will be called input sequences, and this will be a python list.

# each line of the corpus we'll generate a token list using the tokenizers, text_to_sequences method.example: In the town of Athy one Jeremy Lanigan

[4,2,66,67,68,69,70]
This will convert a line of text like,sentence above into a list of tokens representing the words.input_sequences = []for line in corpus: token_list = tokenizer.texts_to_sequences([line])[0]Then will iterate over this list of tokens and create a number of n-grams sequences, namely the first two words in the sentence or one sequence then the first three are another sequences etc.

for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1] input_sequences.append(n_gram_sequence)

The result of this will be, for the sentence created before, the following input sequences will be generated. The same process will happen for each line, but as you can see the input sequences are simply the sentences being broken down into phrases, the first two words the first three words, etc

We next need to find the length of the longest sentence in the corpus. To do this, we’ll iterate over all of the sequences and find the longest one with code like this.

max_sequence_len = max([len(x) for x in input_sequences])

Once we have our longest sequence length, the next thing to do is pad all of the sequences so that they are the same length. We will pre-pad with zeros to make it easier to extract the label.

Line:                                 Padded Input Sequences:[4 2 66 8 67 68 69 70]               [0 0 0 0 0 0 0 0 0 0 4 2]
[0 0 0 0 0 0 0 0 0 4 2 66]
[0 0 0 0 0 0 0 0 4 2 66 8]
[0 0 0 0 0 0 0 4 2 66 8 67]
[0 0 0 0 0 0 0 4 2 66 8 67 68]
[0 0 0 0 0 4 2 66 8 67 68 69]
[0 0 0 0 4 2 66 8 67 68 69 70]

So now, our line will be represented by a set of padded input sequences that look like example above. Now, that we have our sequences, the next thing we need to do is turn them into x and y, our input values and their labels. When you think about it now that the sentences are represented in this way, all we have to do is take all but the last character as the x and then use the last character as the y on our label.

We do like this above, where for the first sequence, everything up to the 4 is our input and the 2 is our label.

And similarly, here for the second sequence where the input is two words and the is the third word, tokenized to 66.

By this point, it should be clear why we did pre-padding, because it makes it much easier for us to get the label simply by grabbing the last token.

So now, we have to split our sequences into our x's and our y's. To do this, let's grab the first n tokens, and make them our x's. X = input_sequences[:,:-1]
labels = input_sequences[:,-1]
We'll then get the last token and make it our label. Before the label becomes a y, there' s one more step, and you'll see that shortly. # One-hot encode with keras convert list to a categorical. The number of classes which is my number of words.Y = tf.keras.utils.to_categorical(labels, num_classes=total_words)Now, I should one-hot encode my labels as this really is a classification problem. We can classify from the corpus, what the next word would likely be.

If we consider this list of tokens as a sentence, then the x is the list up to the last value and the label is the last value which in this case is 70. The y is as one-hot encoded array whether length is the size of the corpus of words and the valaue that is set to one is the one at the index of the label which in this cases is the 70th element.

So now we can build a neural network that can, given a sentence, predict the next word.

# Now that we have our data as xs and ys, it's relatively simple for us to create a neural network to classify what the next word should be, given a set of words.model = Sequential()# Embedding layer 1st parameter  : we'll want it to handle all of our words, so we set that in the first parameter.# 2nd parameter: number of dimensions to use to plot the vector for a word. # finally, the size of the input dimensions will be fed in, and this is the length of the longest sequence minus 1. We subtract one because we cropped off the last word of each sequence to get the label.model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))# Bidirectional LSTM#Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.#Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.# We specified 150 units heremodel.add(Bidirectional(LSTM(150)))# Finally there is a dense layer sized as the total words, because we have as many outputs as the total word count.model.add(Dense(total_words, activation='softmax'))adam = Adam(lr=0.01)#We're doing a categorical classification, so we'll set the laws to be categorical cross entropy.#And we'll use adam optimizer to minimize loss. Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')#And we specified epochs number, One Epoch is when an entire dataset is passed forward and backward through the neural network only once.history = model.fit(xs, ys, epochs=500, verbose=1)#print model.summary()print(model)

We can see train loss with the history method. As you can see our model decrease to 0.1 level quickly in the first 10 epochs. And it continues with little ups and downs.

To learn more about LSTM you can go to this link

If we want to predict the next 10 words in the sentence to follow this.

# We need return our text into sequences to do prediction, because our input shape like this below.tokent_list = tokenizer.text_to_sequences([text])[0]#we need to do pre padding to make each sequences same length by longest sentence in the corpustoken_list = pad_sequences([token_list], maxlen = max_sequence_len-1, padding="pre")text = Laurence went to dublin[0,0,0,0,0,0,0,134,13,59]# and by passing our token list into the prediction function we can do prediction.
# This will give us the token of the word most likely to be the next one in the sequence.
predicted = model.predict_classes(token_list)

And if we run this code line below, we can see predicted next word as ‘old’.

# we can do a reverse lookup on the word index items to turn the token back into a word and to add that to our text.text = "Laurence went to dublin "next_words = 10With for loop below, we can predict next word. We can  specify next word number above. Each iterartion add new predicted word our text.for _ in range(next_words):    token_list = tokenizer.texts_to_sequences([text])[0]    token_list = pad_sequences([token_list],          maxlen=max_sequence_len-1, padding='pre')    predicted = model.predict_classes(token_list, verbose=0)    output_word = ""        for word, index in tokenizer.word_index.items():            if index == predicted:            output_word = word            break        seed_text += " " + output_wordprint(seed_text)

If we want predict next 10 word after the seed_text, we can specify next_word as 10, when we look at output you can see the sentence with added 10 words. ‘Laurence went to dublin old welcome stranger james some mine she stand your ground’ it doesn’t make enough sense, but if we train our model with much larger corpus we can obtain good results.

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np
tokenizer = Tokenizer()
data = open('/tmp/irish-lyrics-eof.txt').read()
corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print(tokenizer.word_index)
print(total_words)
input_sequences = []for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# pad sequencesmax_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
# create predictors and labelxs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
#Model buildmodel = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100, verbose=1)
#print model.summary()
print(model)
seed_text = "Laurence went to dublin "next_words = 10#Predictionfor _ in range(next_words): token_list = tokenizer.texts_to_sequences([seed_text])[0] token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') predicted = model.predict_classes(token_list, verbose=0) output_word = "" for word, index in tokenizer.word_index.items(): if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)

--

--