My First Attempt at an NLP Project for Text Generation: Part 2

Andy Martin del Campo
4 min readJan 23, 2020

--

For a recent project, I embarked on my first NLP project. I was fascinated by the accomplishments of OpenAI’s GPT-2 Transformer. If you aren’t familiar with it, here is there homepage. It can do so many things and it all looks effortless. I thought that maybe with a little background and the code from the repo, I could code up something similar — wrong, that thing is a monster. It was, after all, trained on 40GB of text data. I can’t match that on my own machine and hope to get something similar. So I decided to start small and work my way up. Here is a link to previous post.

import numpy
import sys
import re
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Embedding
from keras.utils import np_utils

After looking into text generation and NLP models I wanted to find a starting point. That is when I found several blog posts and code for an RNN character prediction model. Now an RNN is like a cousin of a transformer, but I had to learn somewhere. Essentially what I was trying to do was take in a data set of Tweets and predict a few words after a seed text. If only anything was ever that simple…

tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(corpus)

To begin with any NLP model you have to have data that the computer understands. In the case of NLP it is tokens or number representations for words basically. Using NLTK’s TweetTokenizer and their list of stopwords, I was on my way to training my first model.

Now to predict the next letter, the model has to be trained on all the sequences of letters; we know these as words. The model needs to have seen the word before and, based on the context before the word, pick the letter with the highest probability of being next. After creating a dictionary that converts each character into an integer so the computer understands it, I then made a list of every single sequence the letters came in and would try to predict what came next. There ended up being close to half a million sequences.

Isn’t this cute?

There are a few other data preparation parts but you can find those in the notebook dedicated to this model. The basic idea is to feed in sequences of letters and train the model to predict what letter will come next, based on the patterns you have fed in.

I am going to assume that you have some experience with neural networks and this is not your first rodeo. With that token, lets get into RNN or recurrent neural networks. RNN’s have the ability to remember prior inputs from previous layers while vanilla neural networks cannot. RNN’s are useful for text processing because of their ability to remember different parts of a series of inputs. LSTMs or Long Short Term Memory networks are a kind of RNN. RNN’s suffer from a vanishing gradient problem. The ability to preserve context of earlier inputs degrades over time. Irrelevant data is accumulated over time and blocks out relevant data. LSTM deals with the vanishing gradient problem by choosing to forget information deemed unneccesary by the LSTM algorithms. LSTMs can focus more on the data that matters.

You can build a model more ways than I would like to think about. This is just the setup that I have decided to go with. You can add more layers, less layers, and on and on but your model may not converge. This model already takes several hours to train, but feel free to play around with it on your own. Remember to save your weights after you have trained your models. That way you can come back and look at your model’s results without waiting for it to train again.

#saves model weights so that you don't have to run the model again
filename = "character_model_weights_saved.hdf5"
model.save_weights(filename)
print("saved model weights")

After the model is trained, there has to be a dictionary to turn the number outputs back into characters so the model is outputting text and not numbers. Although, sometimes you still get a lot of random numbers since they were in your data set. Fun! This is the first model in my most recent project and will show you how to make a word prediction model in my next blog post.

The above outputs are what the seed was and then the response to the seed. As you can see they aren’t that bad and they are words, but they could be better. Here is a link to the next post.

--

--

Andy Martin del Campo

Aspiring Data Scientist with a background in Electrical Engineering and Networking. Passionate about motorcycles and coffee.