Pre-Processing Tweets for Sentiment Analysis

Mike Erb
Analytics Vidhya
Published in
6 min readSep 18, 2020
Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash

When doing any Natural Language Processing (NLP) you will need to pre-process your data. In the following example I will be working with a Twitter dataset that is available from CrowdFlower and hosted on data.world.

Review the data

There are many things to consider when choosing how to preprocess your text data, but before you do that you will need to familiarize yourself with your data. This dataset is provided in a .csv file; I loaded it into a dataframe and proceeded to review it.

dataframe = pd.read_csv(data_file)
dataframe.head()
The first five rows of the dataframe

Just looking at the first five rows, I can notice several things:
* In these five tweets the Twitter handles have been replaced with @mention
* They all have the hashtag #SXSW or #sxsw
* There is an html character reference for ampersand &
* There are some abbreviations: hrs, Fri
* There are some people’s real names, in this case public figures

This dataset contains about 8500 tweets, I won’t be able to review them all, but by reviewing segments of the dataset I was able to find other peculiarities to the data.
* There are some url links, some with http or https, and some without
* There are some url links that have been changed to a reference of {link}
* There are some other html characters besides &
* References to a video have been replaced with [video]
* There were many non-english characters
* There were many emoticons

Decision Time

Now that the data has been reviewed, it’s time to make some decisions about how to process it.

  • Lowercase? It’s common when doing NLP to lowercase all the words so that a “Hello” is the same as “hello” or “HELLO”. When dealing with tweets, which are not going to follow standard capitalization rules, you should pause before lowercaseing everything. For example, it’s common to use all caps for emphasis in a tweet when you would rarely do that when writing a formal sentence. For my dataset I chose to just change all the letters to lowercase because I didn’t think I had a large enough dataset that enough additional information would be gained by keeping more than one version of a word. Converting all the letters is easily done using a Python string function.
df_clean['tweet_text'] = dataframe.tweet_text.str.lower()
  • URL links I didn’t think URLs would help with sentiment analysis so I wanted to remove them. Removing URLs is not as simple as changing letters to lowercase, it involved using regular expressions (regex). I used two expressions, one for URLs with http or https, and a second for URLs without them, but with or without www.
df_clean.tweet_text = df_clean.tweet_text.apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))df_clean.tweet_text.apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

Writing regular expressions can be tricky; I use regexr.com to test mine out.

  • Placeholders Some text cleaning was already done on the dataset which replaced some links with {link} and all the videos with [video]. They don’t seem to be of any value when doing sentiment analysis so I will remove them with regex.
df_clean.tweet_text = df_clean.tweet_text.apply(lambda x: re.sub(r'{link}', '', x))df_clean.tweet_text = df_clean.tweet_text.apply(lambda x: re.sub(r"\[video\]", '', x))
  • HTML reference characters I don’t think these are of any value to the analysis so they also should be removed.
df_clean.tweet_text = df_clean.tweet_text.apply(lambda x: re.sub(r'&[a-z]+;', '', x))
  • Non-Letter characters I decided to get rid of all the characters that weren’t letters, punctuation that is commonly used in emojis, or hash marks. There were a few non-english characters in the tweets. I didn’t think they would add to the analysis so I wanted to remove them. Numbers typically don’t add much if any information so I wanted to remove them. Punctuation that isn’t usually associated with an emoji needed to be removed too because it didn’t add anything. I did all this with one regex.
df_clean.tweet_text = df_clean.tweet_text.apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#]", '', x))
  • Twitter handles Prior the text pre-processing stage, I changed all the twitter handles to @mention in acknowledgement of the need for protecting people’s privacy. Because this dataset has been public for years I wasn’t adding much protection, but when creating new datasets, an attempt to anonymize the data should be made. Since I had already changed them all to @mention it was easier to remove all of them; again because they added little information.
df_clean.tweet_text = df_clean.tweet_text.apply(lambda x: re.sub(r'@mention', '', x))
  • Real Names They are another privacy issue. If this was a brand new dataset, I think they should be removed prior to publishing the data. In this case where the data has been public for years, I left them in the dataset.
  • Abbreviations Depending on your analysis you may want to deal with abbreviations vs. full words. From above there is hrs and Fri. You may want to keep them as is or may want hrs to be considered the same as hours when doing sentiment analysis. For my analysis I used the WordNetLemmatizer within the NLTK package. It does not deal with these abbreviations and I didn’t think it would add enough value to create a function to search for and convert abbreviations to the full version of the words. This would be a very large undertaking given all the existing abbreviations.

Tokenize the text

Now to tokenize the pre-processed tweets. There are a variety of ways to do this but I chose to use the TweetTokenizer from NLTK. It knows to keep emojis together as a token and to keep hashtags together. I already removed all the Twitter handles, but if I hadn’t it would tokenize them as @ plus the handle.

from nltk.tokenize import TweetTokenizertknzr = TweetTokenizer()df_clean['tokens'] = df_clean['tweet_text'].apply(tknzr.tokenize)

Let’s compare the cleaned tweet_text with the tokenized version of a segment of the data.

cleaned tweet_text vs. tokens

The tokenizer returned a list of strings for each tweet. You can see that the hashtags are kept, ie. #sxsw, words with dashes or apostrophes within them are kept, ie. pop-up, other punctuation is tokenized as individual tokens, ie. the colon at the end of tweet #515, and emojis are kept intact, ie. :) in tweet #514.

More Cleaning

Because I didn’t remove punctuation that is commonly used for emojis earlier, now that the emojis have been tokenized, it’s time to remove it.

I first created a punctuation list and then applied a filter to each tokenized list, keeping only the tokens that are not in the punctuation list. Using this method will leave all the punctuation that is part of a word or an emoji.

PUNCUATION_LIST = list(string.punctuation)def remove_punctuation(word_list):
"""Remove punctuation tokens from a list of tokens"""
return [w for w in word_list if w not in PUNCUATION_LIST]
df_clean['tokens'] = df_clean['tokens'].apply(remove_punctuation)

More Decisions

  • Do you want to keep the stop words? According to a study, removing stop words from tweets when doing sentiment analysis degrades classification performance. I decided to trust this study’s findings and didn’t remove the stop words.
  • Do you want to stem or lemmatize the words? For some of the models I ran, I used the WordNetLemmatizer mentioned above and for some I didn’t. I got mixed results when lemmatizing the words in the tweets, but that may be due to the contents or size of my dataset.

Model Time

I don’t go through the modeling steps here, but after the above text pre-processing I was ready to train some classifiers.

If you would like to check out my full project and additional code, it’s hosted on github here.

--

--

Mike Erb
Analytics Vidhya

Data Scientist with a background in Computer Science and as an Entrepreneur in the Bike industry — Based in Ithaca, NY