Language Modelling and Text Generation using LSTMs — Deep Learning for NLP

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import numpy as np
data = """The cat and her kittens
They put on their mittens,
To eat a Christmas pie.
The poor little kittens
They lost their mittens,
And then they began to cry.
O mother dear, we sadly fear
We cannot go to-day,
For we have lost our mittens."
"If it be so, ye shall not go,
For ye are naughty kittens."""
def dataset_preparation():
pass
def create_model():
pass
def generate_text():
pass
tokenizer = Tokenizer()def dataset_preparation(data):    corpus = data.lower().split("\n")    
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
    input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
    max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences,
maxlen=max_sequence_len, padding='pre'))
"""
Sentence: "they are learning data science"
PREDICTORS | LABEL
they | are
they are | learning
they are learning | data
they are learning data | science
"""
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)
  1. Input Layer : Takes the sequence of words as input
  2. LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
  3. Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting.
  4. Output Layer : Computes the probability of the best possible next word as output
def create_model(predictors, label, max_sequence_len, total_words):
input_len = max_sequence_len - 1
model = Sequential()
model.add(Embedding(total_words, 10, input_length=input_len))
model.add(LSTM(150))
model.add(Dropout(0.1))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(predictors, label, epochs=100, verbose=1)
def generate_text(seed_text, next_words, max_sequence_len, model):
for j in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=
max_sequence_len-1, padding='pre')
predicted = model.predict_classes(token_list, verbose=0)

output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
return seed_text
X, Y, max_len, total_words = dataset_preparation(data)
model = create_model(X, Y, max_len, total_words)
text = generate_text("cat and", 3, msl, model)
print text
>>> "cat and her lost kittens"
text = generate_text("we naughty", 3, msl, model)
print text
>>> "we naughty lost to day"
  1. Adding more data
  2. Adding more LSTM layers
  3. Fine Tuning the network
  4. Running it for longer epoochs

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shivam Bansal

Shivam Bansal

Data Scientist, Kaggle Kernels GrandMaster, Follow my complete blog at: www.shivambansal.com and kaggle profile : https://www.kaggle.com/shivamb