Quick Text Pre-Processing

Rob Zifchak
The Startup
Published in
5 min readJun 15, 2020

Making sense of messy tweets

Photo by Jelleke Vanooteghem on Unsplash

As I was working on an LDA(Latent Dirchlet Allocation) project in an effort to make sense of unlabeled tweets, I realized that most of my efforts were spent re-visiting the pre-processing stages of the problem rather than the modeling itself. Shall I remove all special characters? How about integers? To stem, or to lem, or both? While the specific implementations will vary on a case by case basis, there are undoubtedly some common preprocessing techniques that are applied in any Natural Language Processing task. Below are some of the methods I’ve used.

The Data

As mentioned, I’ll be working with tweets. I’ve pulled 12,000 tweets pertaining to anxiety utilizing the GetOldTweets3 Python Library. One of the benefits of using GetOldTweets3 is that you do not need to register for OAuth tokens through the Twitter for Developer’s portal (p.s. Twitter Dev Team I’m still waiting!). I’m not going to go into detail about the library, but I’ve linked it above if you’re interested in checking it out. There are several endpoints and query options, but I’ll be focusing on the ‘text’ of the tweet. Unlike working with a clean document containing grammatically correct, punctuated sentences, working with tweets can be a little bit of a pain — given that they contain many special characters, urls, incoherent ramblings, etc. This is something to be mindful of throughout the cleaning process. So far, I’ve found iterative trial and error to be effective.

Here’s a text sample of a few tweets:

['Season 4 of 13 Reasons Why is stressing me tf out already. Clay really think he Batman. That’s why he has anxiety now.', '#fear #love #anxiety #covid #horror #life #coronavirus #mentalhealth #art #depression #faith #hope #motivation #fearless #scary #dark #quotes #courage #creepy #corona #peace #poetry #success #stress #selflove #inspiration #terror #halloween #pain #formylus', 'I feel like the 7pm cheer has slowly morphed into the 7pm scream of anxiety.', "if twitter started your original anxiety issues then twitter is the wrong place to get help from unless that's really not what your intent was.Twitter isn't professional help and in fact the things said here can often make a situation worse.", 'Thread: Some beautiful moments from protests I’ve attended in NYC over the past week and a half. This post is not intended to glorify protests or to show off - I only wish to spread a bit of positivity during a time when so many are feeling intense anger, heartbreak, & anxiety.']

Looks pretty messy, lets get into it!

Cleaning Steps

  1. Remove URL’s
  2. Remove stop words and punctuation
  3. Remove remaining non alphabetical characters
  4. Lemmatize text

Stop Words

While commonly used words in language provide important grammatical structure for human communication, they do little for machine interpretation. If I were to create a frequency distribution of words contained in my tweets, it would undoubtedly be clogged up with words like “the, of, who, go, there, was” etc. In order to prevent this from happening and to focus on words that actually provide any sort of meaningful context, it is a convention to remove these words from our modeling vocabulary entirely. I’ll be using the standard english stop words from the NLTK(Natural Language ToolKit) library, and punctuation from the string library.

Tokenizing

Now that I have my stop word list, I can move forward with tokenizing. There are different tokenizers for different tasks. To start, I’m using NLTK’s word_tokenize() method. This tokenizer simply splits words by whitespace, similar to pythons .split() method.

In: ['Heres is a random sentence to test']Out: ['Heres', 'is', 'a', 'random', 'sentence', 'to', 'test']

I’ll apply both tokenization and omission of stop words with the below function and apply it to the text column of my pandas dataframe that contains the tweet text.

The output:

['season', '13', 'reasons', 'stressing', 'tf', 'already', 'clay', 'really', 'think', 'batman', '’'],['fear', 'love', 'covid', 'horror', 'life', 'coronavirus', 'mentalhealth', 'art', 'depression', 'faith', 'hope', 'motivation', 'fearless', 'scary', 'dark', 'quotes', 'courage', 'creepy', 'corona', 'peace', 'poetry', 'success', 'stress', 'selflove', 'inspiration', 'terror', 'halloween', 'pain', 'formylus'],['feel', 'like', '7pm', 'cheer', 'slowly', 'morphed', '7pm', 'scream'],['twitter', 'started', 'original', 'issues', 'twitter', 'wrong', 'place', 'get', 'help', 'unless', "'s", 'really', 'intent', 'was.twitter', "n't", 'professional', 'help', 'fact', 'things', 'said', 'often', 'make', 'situation', 'worse'], ['thread', 'beautiful', 'moments', 'protests', '’', 'attended', 'nyc', 'past', 'week', 'half', 'post', 'intended', 'glorify', 'protests', 'show', 'wish', 'spread', 'bit', 'positivity', 'time', 'many', 'feeling', 'intense', 'anger', 'heartbreak']

There are a few issues here. Integers that are prepended to a word(or token) still remain, such as ‘7pm’, integers grouped like ‘13’, punctuations mid word, such as ‘was.twitter’, and different stylings of single quotes still remain. Also, url’s will be broken into separate phrases.

Take the following tweet for example:

'#painting. #anxiety #batman. @procreate #pocket. #art #color #paint #iphone #finger #mark #stroke #brush @Manhattan, New York https://www.instagram.com/p/B5LxQgylvx0/?igshid=j68k86tqc9l'

With the standard tokenization and stop word function applied above, the output is:

['painting', 'batman', 'procreate', 'pocket', 'art', 'color', 'paint', 'iphone', 'finger', 'mark', 'stroke', 'brush', 'manhattan', 'new', 'york', 'https', '//www.instagram.com/p/b5lxqgylvx0/', 'igshid=j68k86tqc9l']

Side note: if you’re interested in retaining hashtags or mentions without writing regex, check out NLTK’s tweet tokenizer!

As I don’t care to retain any urls, I will remove them all before tokenization and stop word removal with the following simple method that utilizes regex to target any string beginning with “http” and capture the full string that follows.

Removing remaining noise

I’ll also remove any words containing integers such as ‘7pm’, ‘1000mg’, etc. with the code below. This function will remove any tokens that are not alphabetical. Depending on the use case this may not be desirable. But I wanted to just clear out the remainder of the noise.

Lemmatization

In order to further reduce noise in our text data, and get the most accurate frequency distributions possible, cue lemmatization. Lemmatization is the act of breaking a word token down to its root meaning. Many words have multiple inflected forms that will appear as different tokens, while conveying the same meaning. For example, “to turn” may appear as “turned”, “turning”, “turns”. We want to just retain the base form of this word, as if looking it up in a dictionary. This will prevent us from having multiple tokens that are really just pointing to one. Unlike stemming, lemmatization will not return non-existent words. For a deeper look at stemming and lemming, check out this datacamp article.

Lets take a look at the results of the same 5 tweets from earlier:

[['season', 'reasons', 'stressing', 'tf', 'already', 'clay', 'really', 'think', 'batman'], ['fear', 'love', 'covid', 'horror', 'life', 'coronavirus', 'mentalhealth', 'art', 'depression', 'faith', 'hope', 'motivation', 'fearless', 'scary', 'dark', 'quotes', 'courage', 'creepy', 'corona', 'peace', 'poetry', 'success', 'stress', 'selflove', 'inspiration', 'terror', 'halloween', 'pain', 'formylus'], ['feel', 'like', 'cheer', 'slowly', 'morphed', 'scream'], ['twitter', 'started', 'original', 'issues', 'twitter', 'wrong', 'place', 'get', 'help', 'unless', 'really', 'intent', 'professional', 'help', 'fact', 'things', 'said', 'often', 'make', 'situation', 'worse'], ['thread', 'beautiful', 'moments', 'protests', 'attended', 'nyc', 'past', 'week', 'half', 'post', 'intended', 'glorify', 'protests', 'show', 'wish', 'spread', 'bit', 'positivity', 'time', 'many', 'feeling', 'intense', 'anger', 'heartbreak']]

While this is a good start, I can already think of a few revisions I’ll want to look into! For one, setting a minimum threshold for token length, somewhere around three, to further strip away words that are not providing any context. This will also take care of acronyms. I also may want to add some of those instead to my stop word list after I do some more EDA. Hopefully I’ll have decent results to report for my next post!

Thanks for reading! I always welcome any thoughts, suggestions, criticisms, or that one-line solution! Feel free to contact me on linkedin if you want to connect!

--

--

Rob Zifchak
The Startup

Data Science | Data Engineering | Machine Learning