My First Attempt at an NLP Project for Text Generation: Part 1

Andy Martin del Campo
4 min readJan 23, 2020

--

For my most recent project, I wanted to dive into the deep world of NLP (pun intended). The idea for my project was simple — to build a coherent text generator trained on tweets that I pulled from Twitter. This is an idea that I had after learning about how incredible GPT-2 and similar Transformer models could generate human like text. Sounds pretty simple, right? I mean, you can access their code and everything they did should be pretty straightforward. Wrong. Its easy to get lost in all of the classes and functions and have no idea where to start. So this is where I started my project. Kind of from the ground up. I wanted to build up not only my models but my experience with Keras and TensorFlow.

With any decent project, you need a decent data set. For this part of the process, I decided to go with Twitter. Twitter is an interesting place for building a data set. There are some well thought out tweets and then on the other spectrum there is word garbage, possibly made by other text generating bots. Twitter’s API to search tweets is made accessible from Python with the Tweepy library. For all the details as to how I created my data set, here is the notebook. This notebook covers topics from creating your app on the Twitter Development page all the way to saving a data set into a CSV file for later use.

My first few attempts at creating a data set began with me searching for key words like “big data” or “data science.” Originally, I wanted to just make a bot that could produce human like text. But I ran into several issues with this data set. The first issue is the fact that even though I could pull 10,000+ tweets with these keywords, there was no way to determine if I was getting a tweet from a leader in the Data Science community or someone who just had a key word along with some other undesirable words in the tweet (this is the word garbage I was referring to earlier). The second issue I encountered was with the actual text generation of the models. The models would get stuck in a loop of predicting “data science big data” almost like the models learned what I searched for and just printed out my searches.

I went back to the drawing board as to how I wanted to predict text. A good model is only ever as good as the data that it is trained on. So I set out to find good wholesome Twitter accounts that I felt I could rely on to provide me with complete tweets that are coherent well developed. I turned to the customer service side of Twitter. I looked into ten of the most renowned customer service Twitter accounts that responded to customers if they had any issues or would just tweet out on their own. Now these larger companies would not want their tweets to have errors or any other issues with their tweets, so I felt a lot more confident in this new data set.

twitter_accounts = [“from:XboxSupport”, “from:UPSHelp”, “from:JetBlue”, “from:NikeSupport”, “from:ComcastCares”,
“from:AmazonHelp”, “from:Zappos_Service”, “from:AskTarget”, “from:Tesco”, “from:Lululemon”]

However, there is a limit to the number of tweets you are able to pull from a given Twitter account. Twitter has their reasons for doing this and they want to protect their users, so I just did what I could. Looping through these ten accounts and trying to generate a list of as many tweets as I could I ended up with about 8,000 decent tweets.

tweet_list = []
for account in twitter_accounts:
tweets = tw.Cursor(api.search,
q=account,
lang=’en’,).items(2000)
new_tweets = [tweet.text for tweet in tweets]
tweet_list += new_tweets
print(account)

Maybe some other day I will return to my Twitter notebook and find more accounts that have a similar customer service tweeting regimen and build an even larger data set but after training models and fine-tuning the data set, I was ready to just solidify this part of my project and move on. A word cloud gave me extra confidence that I was working towards what would hopefully be coherent ‘customer service like’ text generation.

The largest words like please and hi and others made me feel comfortable that my text generation would be something I was looking for — or at least it wouldn’t be the data set’s fault when everything fell apart. With this part of the project finished for now, I saved the data into a CSV file and moved onto the next part, my first model. Here is the link to the next blog.

--

--

Andy Martin del Campo

Aspiring Data Scientist with a background in Electrical Engineering and Networking. Passionate about motorcycles and coffee.