My First Attempt at an NLP Project for Text Generation: Part 3

Andy Martin del Campo
5 min readJan 23, 2020

--

This is a part of a series that I started for a recent NLP project I embarked on. For the character prediction model, go to my other blog post. For my most recent project I wanted to dive into the topic of text generation using NLP models. I started by building a data set of tweets pulled from Twitter myself. For more information on the Twitter part visit my first post. Using the Tweepy Python library to access the Twitter API, I pulled as many tweets as I could from ten well-regarded customer service related twitter users. Naturally these were all larger corporations but after looking through the tweets, I was able to determine that they were decent quality and weren’t just word garbage like so much of Twitter is.

Word cloud showing the most popular words in the tweet data set.

After cleaning and processing the tweets I put them into a csv file to be accessed by my models at later times. When trying to find any sort of RNN or LSTM text generation a common topic comes up — character level models. While I did make a character level model, I found the results of its text generation to be underwhelming. At this point I decided to build a word prediction model. So instead of predicting what character would come next, I would create a model that predicted what word would come next.

from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Embedding
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

The entire process can be done using Keras and a few other libraries which you can find in my repository for this project. I’ll post it at the end. The first process when dealing with NLP models once you have your data set is to tokenize the data. I hadn’t used the Keras Tokenizer before so I went ahead and used that one. It isn’t the best Tokenizer and I think in future projects I will stick with NLTK’s Tokenizer or others. There was a strange issue where I was trying to read my CSV file using just an open file function like this…

file = open('customer_service_data.csv', encoding='utf-8').read()

…and then just trying to tokenize the file. Well, there is some issue with this method because it wouldn’t put the words into tokens; it was taking each individual character and instead making that a token. Perhaps an idea to play with for my character prediction model, but not this one. The fix I found was to instead read the CSV file into a pandas data frame and then call the tokenizer on that data frame instead. Now this worked.

Once you have the tokens, you then need to make sentence sequences to create a flat data set. Keras has a function for this texts_to_sequences. Instead of posting snippits of code I will just link my repo. The last process before creating the model is to pad the sequences so that they will all be the same length. Now your data is ready to train a model.

The Model:

model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_sequence_len — 1))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(total_words, activation=’softmax’))

This model has a lot of similarities to the previous blog post’s model with one exception — the Embedding layer. An embedding layer is used to compress the input feature space into a smaller one. One can imagine the Embedding layer as a simple matrix multiplication that transforms words into their corresponding word embeddings. The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document). In this case, I hoped that it would allow the model to train faster. Now this isn’t anything crazy. It is a sequential model which you need for optimal text generation since you need to have the history of the previous words to help you predict the next words. An embedding layer is used to compress the input feature space into an optimal one. And then there are the three LSTM layers that do the heavy lifting with dropout layers to prevent overfitting. If you aren’t sure what LSTM layers are, you may want to refer to my previous blog post.

Fitting 100 epochs took around 12 hours, so be ready to wait a while on this one. This is nothing compared to the several hours each epoch took for the character prediction model. I did continue to see the loss function go down, although only marginally towards the end. You could probably run less epochs and get similar results. Once your model is finished training, you are ready to predict the next word and get some text. A function that takes a seed text as input and predicts the next words, tokenizes the seed texts, pads the sequences, and passes them to be the trained model for prediction is the final piece to this puzzle. With that, you are able to start getting some output! Type something in like ‘help’ or whatever comes to mind and see what the output is.

This blog post is mostly a summary of what I was trying to accomplish with my latest project. If you find anything interesting or want to chat about any of my work, feel free to reach out. My plan is to learn more about TensorFlow and creating models not using Keras and to build a model with attention, and then eventually get to a transformer model. Thanks for reading!

Link to the repo

--

--

Andy Martin del Campo

Aspiring Data Scientist with a background in Electrical Engineering and Networking. Passionate about motorcycles and coffee.