Text Generation Using Long Short Term Memory Network

Published in

Analytics Vidhya

3 min readJul 20, 2020

We will train an LSTM Network on a textual data and by itself learn to generate new text that appears to be of the same form as the training material. If you train your LSTM on textual data, it will learn to generate out new words similar to what we trained with. The LSTM will typically learn human grammar from the source data. You can also use similar technology to complete sentences when a user is entering text like in chatbots.

Importing our dependencies — using tensorflow 2.x

import string
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,LSTM,Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences

Reading the data

file=open('t8.shakespeare.txt','r+')
data=file.read()

Text Cleaning

After getting your text data, the first step in cleaning up text data is to have a strong idea about what you’re trying to achieve, and in that context review your text to see what exactly might help.

There are many punctuations and numerical characters in the data to get the rid off it

data=data.split('\n') 
data=data[253:]
data=' '.join(data)

The cleaner function helps to remove the punctuation and numbers in the data and converts all the characters in to lowercase

def cleaner(data):
    token=data.split()
    table=str.maketrans('','',string.punctuation)
    token=[w.translate(table) for w in token]
    token=[word for word in token if word.isalpha()]
    token=[word.lower() for word in token]
    return tokenwords=cleaner(data=data)

Creating a sequence of words

seed_length is 50 that means first 50 words will be my input and the next word will be my output. It needs a lot of computation power and memory to process all data. So I am using only first 100000 words to train my neural network.

seed_length=50+1
sentence=list()
for i in range(seed_length,len(words)):
    sequence=words[i-seed_length:i]
    line=' '.join(sequence)
    sentence.append(line)
    if i >100000:
        break

Neural Network requires that the input data be integer encoded, so that each word is represented by a unique integer. After encoding convert the integers into the sequence of integers

tokenizer=Tokenizer()
tokenizer.fit_on_texts(sentence)
sequence=tokenizer.texts_to_sequences(sentence)
sequence=np.array(sequence)

Splitting the independent and the target variables

X,y=sequence[:,:-1],sequence[:,-1]
vocab_size=len(tokenizer.word_index)+1
y=to_categorical(y,num_classes=vocab_size)

Creating an LSTM Network

The Embedding layer is defined as the first hidden layer of a network.It must require 3 arguments

vocab_size — the size of the vocabulary in the text data.
output_dim — size of the vector in which words will be embedded.
input_length — length of the input sequence.

model=Sequential()
model.add(Embedding(vocab_size,50,input_length=50))
model.add(LSTM(100,return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100,activation='relu'))
model.add(Dense(vocab_size,activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
model.summary()

Training our Model

Train your model for more number of epochs by that our network will able to learn the how to generate words.

model.fit(X,y,batch_size=256,epochs=1000)

The generate function helps us to generate the word after the set of 50 words as a input to our model

def generate(text,n_words):
    text_q=[]
    for _ in range(n_words):
        encoded=tokenizer.texts_to_sequences(text)[0]
        encoded=pad_sequences([encoded],maxlen=sequence_length,truncating='pre')
        prediction=model.predict_classes(encoded)
        for word , index in tokenizer.word_index.items():
            if index==prediction:
                predicted_word=word
                break
        text=text+' '+predicted_word
        text_q.append(predicted_word)
    return ' '.join(text_q)

using the function and generate the next 100 words

input = sentence[0]
generate(input,100)

Thanks for reading! I hope this article was helpful.

Your comments, and claps keep me motivated to create more material. I appreciate you! 😊

Text Generation Using Long Short Term Memory Network

Text Cleaning

Creating a sequence of words

Creating an LSTM Network

Training our Model

Written by Mugesh