Text Generation Using Long Short Term Memory Network
We will train an LSTM Network on a textual data and by itself learn to generate new text that appears to be of the same form as the training material. If you train your LSTM on textual data, it will learn to generate out new words similar to what we trained with. The LSTM will typically learn human grammar from the source data. You can also use similar technology to complete sentences when a user is entering text like in chatbots.
Importing our dependencies — using tensorflow 2.x
import string
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,LSTM,Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
Reading the data
file=open('t8.shakespeare.txt','r+')
data=file.read()
Text Cleaning
After getting your text data, the first step in cleaning up text data is to have a strong idea about what you’re trying to achieve, and in that context review your text to see what exactly might help.
There are many punctuations and numerical characters in the data to get the rid off it
data=data.split('\n')
data=data[253:]
data=' '.join(data)
The cleaner function helps to remove the punctuation and numbers in the data and converts all the characters in to lowercase
def cleaner(data):
token=data.split()
table=str.maketrans('','',string.punctuation)
token=[w.translate(table) for w in token]
token=[word for word in token if word.isalpha()]
token=[word.lower() for word in token]
return tokenwords=cleaner(data=data)
Creating a sequence of words
seed_length is 50 that means first 50 words will be my input and the next word will be my output. It needs a lot of computation power and memory to process all data. So I am using only first 100000 words to train my neural network.
seed_length=50+1
sentence=list()
for i in range(seed_length,len(words)):
sequence=words[i-seed_length:i]
line=' '.join(sequence)
sentence.append(line)
if i >100000:
break
Neural Network requires that the input data be integer encoded, so that each word is represented by a unique integer. After encoding convert the integers into the sequence of integers
tokenizer=Tokenizer()
tokenizer.fit_on_texts(sentence)
sequence=tokenizer.texts_to_sequences(sentence)
sequence=np.array(sequence)
Splitting the independent and the target variables
X,y=sequence[:,:-1],sequence[:,-1]
vocab_size=len(tokenizer.word_index)+1
y=to_categorical(y,num_classes=vocab_size)
Creating an LSTM Network
The Embedding layer is defined as the first hidden layer of a network.It must require 3 arguments
- vocab_size — the size of the vocabulary in the text data.
- output_dim — size of the vector in which words will be embedded.
- input_length — length of the input sequence.
model=Sequential()
model.add(Embedding(vocab_size,50,input_length=50))
model.add(LSTM(100,return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100,activation='relu'))
model.add(Dense(vocab_size,activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
model.summary()
Training our Model
Train your model for more number of epochs by that our network will able to learn the how to generate words.
model.fit(X,y,batch_size=256,epochs=1000)
The generate function helps us to generate the word after the set of 50 words as a input to our model
def generate(text,n_words):
text_q=[]
for _ in range(n_words):
encoded=tokenizer.texts_to_sequences(text)[0]
encoded=pad_sequences([encoded],maxlen=sequence_length,truncating='pre')
prediction=model.predict_classes(encoded)
for word , index in tokenizer.word_index.items():
if index==prediction:
predicted_word=word
break
text=text+' '+predicted_word
text_q.append(predicted_word)
return ' '.join(text_q)
using the function and generate the next 100 words
input = sentence[0]
generate(input,100)
Thanks for reading! I hope this article was helpful.
Your comments, and claps keep me motivated to create more material. I appreciate you! 😊