Natural Language Processing: Cleaning up Tweets

Published in

CodeX

10 min readDec 4, 2022

Twitter has been dominating the news recently, so I thought I’d take this opportunity to write about my first stab at a natural language processing project. As part of my Springboard Data Science Bootcamp, my final capstone project was to build a classification model to differentiate between disaster and non-disaster related tweets. By the time I came to working on my capstone project, I’d spent hours on Data Camp and LinkedIn courses learning about a myriad of strategies on how to handle text data. So, personally, I wanted to use this project to really put my data wrangling skills to the test.

The Data

I used a dataset form Kaggle which comprised of 7613 tweets, of which 4342 tweets were labelled as tweets not about any disasters and 3271 were labelled as tweets about disasters. Roughly two thirds of the tweets had their location tagged, and majority of these tweets (99%) had an associated keyword alongside the tweets. Before I dived into exploring the tweets, I had a look at the distribution of the missing data: I was looking to see if a specific class of data had specific parts missing. For example, were more non-disaster tweets missing location tags versus disaster tweets. This however yielded no meaningful results: the missing data seemed to be random. I always find it easiest to visualise these problems, so I created a bar chart of the availability of the data (see below) which illustrates the random distribution of missing data. It was also a relief to also see that every tweet in the dataset was labelled, so I didn’t have to reduce the size of the dataset any further.

Keywords

7552 tweets had keywords tagged in the dataset. What’s interesting was that out of the 7552 keywords, only 221 were unique! So, for the first step in the data wrangling process I focused on cleaning up the keywords, as I wanted to explore the keywords in more detail.

How to clean text data? Welcome to the wonderful world of regular expressions.

Regular expressions are sequences of characters, which can be letters or punctuation, which helps design a search pattern. So we can use regular expressions to search through the tweets to see if they contained similar patterns. Python has a built-in package called re which is very useful here. I’d also recommend unidecode, check out the documentation, as it’s a simple way of converting between Unicode and ASCII.

import re
import unidecode

I first removed unwanted punctuation such as exclamation marks, accents, brackets, commas, semicolons, to name a few.

Here’s an example of the regular expressions I used to clean up the keywords, have a look at my notebook if you want more details.

def clean_keywords(keyword):
    cleaned = re.sub(r'%20', ' ', keyword)
    return cleaned
def remove_accents(keyword):
    cleaned = unidecode.unidecode(keyword)
    return cleaned
def remove_punctuation(keyword):
    cleaned = re.sub(r"[!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n -' ]"," ",keyword)
    return cleaned

Then, I had made a countplot of all the keywords that are used by the different classes (take a look at the countplots below).

keywords['keyword'] = keywords.groupby('keyword')['target'].transform('mean')
fig = plt.figure(figsize=(8, 40), dpi=100)
sns.countplot(y=keywords.sort_values(by='keyword', ascending = False)['keyword'], 
              hue = keywords.sort_values(by='keyword', ascending =False)['target'])
plt.savefig('figures/keywords_distributions_before_stemming.png', bbox_inches = 'tight',facecolor='white', transparent =None)
plt.show()

If you take a closer look at the keywords on the plot on the right, it quickly becomes obvious that there are some repetitions, such as wildfire and wild fires. A possible solution to this problem is to stem the words. Stemming is when we reduce words to its stem or the root words. As advantage of this is similar words get grouped into one, such as suicide bomb, suicide bombing, and suicide bomber would all be grouped into suicide bomb.

So, I used Porter Stemmer from NLTK to stem the keywords.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
keywords['stems'] = keywords['keyword'].apply(lambda x: stemmer.stem(x))

Left: keywords prior to stemming. Right: Keywords after stemming

If you look at the two figures we can see after stemming, the number of keywords is reduced but we still have a few that are not quite fixed, for example wildfir and wild fir are still categorised into two. A solution here is to go back and remove spaces between words, or manually to combine these two groups. As I had such a small number of keywords, I had a quick scroll through the words and decided to move on as there weren’t many other groups that could be combined. I decided to now have a look at the visual representations of the 100 most frequent keywords in both the groups. A fun and easy way of visualising this is using word clouds!

from collections import Counter
from wordcloud import WordCloud

def make_dict(tup, dictionary):
    for x, y in tup:
        dictionary.setdefault(x, []). append(y)
    return dictionary
count = Counter(list(keywords[keywords['target']==1]['keyword']))
top_words = {}
make_dict(count.most_common(100), top_words)
df = pd.DataFrame.from_dict(top_words, orient = 'index').reset_index()
df.columns=['word', 'count']
text = df['word'].values
wordcloud_keywords = WordCloud(background_color = 'white', width = 500, height = 300, collocations = True).generate(str(text))
plt.figure(figsize=(8, 8), dpi=100)
plt.imshow(wordcloud_keywords)
plt.axis('off')
plt.title('Disaster Keywords', fontsize = 22)
plt.savefig('figures/wordcloud_disaster_keywords.png', bbox_inches = 'tight',facecolor='white', transparent =None)
plt.show()

There’s definitely some commonalities in the choice of keywords between the two classes, but we also see there are keywords that appear more frequently in one class compared to others: body bags for example appears a lot more frequently as a keyword for non-disaster tweets than disaster tweets. A couple of words (debris and wreckage) only appear as keywords for disaster tweets, whilst aftershock only appears as a keyword in non-disaster tweets. Ultimately, I decided there aren’t sufficient difference in the keywords to try and develop a classification model using these, so I dropped the keywords column for the rest of the project. A possible change I’d have like to make here would be to not drop the column but to combine the keywords column with the tweet column.

Location

Roughly two thirds of tweets had a location tagged, roughly 34% missing from both classes. Out of the 5080 location tags, there were roughly 3341 unique entries. A closer look at the locations showed they could be entered in terms of countries, cities, abbreviations, with dates, and even had punctuation such as dashes, commas, and hashtags.

Again, using regular expressions and lambda functions I removed numbers and spaces.

location['location'] = location['location'].apply(lambda x: remove_accents(x))
location['location'] = location['location'].apply(lambda x: remove_punctuation(x))
def remove_nums(location):
    cleaned = re.sub(r'\d+', '', location)
    return cleaned
location['location'] = location['location'].apply(lambda x: remove_nums(x))
def remove_extra_w_space(location):
    cleaned_text = re.sub(r"\s+"," ",location).strip()
    return cleaned_text
location['location'] = location['location'].apply(lambda x: remove_extra_w_space(x))

This reduced the number of unique locations from 3341 to 3106 out of the 5080 entries. Of this, only 82 of the locations were in common between the two classes. This could be simply because 5080 aren’t really a large enough of sample size to represent the demographics of twitter users. Also, using the number of common words between the two classes has some obvious problems; if we had a larger sample size, the number of common words would increase, however not necessarily because there is more similarity but just because there are more words in the dataset. So, I decided to use cosine similarity instead.

What is cosine similarity? If you are comfortable with dot products of vectors, think of cosine similarity as working out the cosine of the angle between two vectors. Considering the graph of y=cosx (below), we can see that as the x values increase to π/2, or 90 degrees, the y-values or cos(x) decrease from 1 to 0. This means if the angles between the two vectors are closer to zero, cos(x) will be close to 1, meaning the two vectors are similar to each other. However, if cos(x) has a value closer to 0, then the two vectors are nearly perpendicular to each other, meaning they are dissimilar.

So, how does similarities in vectors relate to the location entries? Check out this article which provides a very in-depth explanation of cosine similarity. Imagine that for every word, we have an ‘axis’. We can therefore plot the frequency of use of each word from each class — this is called the vectorising of the words. But instead of looking at the length of the lines, or the magnitude of the vectors, we instead measure the angle between the vectors.

An example of cosine similarity from in Natural Language Processing

Look at my notebook for more details, but here’s the snippet of the code I used to vectorise the location information, and to calculate the cosine similarity:

from sklearn.preprocessing import MultiLabelBinarizer
vectorizer= MultiLabelBinarizer()
mlb = vectorizer.fit(corpus)
df = pd.DataFrame(mlb.transform(corpus), columns=[mlb.classes_], index=['Disaster', 'Non Disaster'])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)

This gives us:

What does this mean? 0.2897 is closer to 0 than to 1, which tells us that the two groups of words, the locations of disaster tweets and the locations of non-disaster tweets, are not very similar.

Finally, the Tweets!

So, full disclosure, there was a lot more I’d have like to have done here but for the sake of deadlines and juggling a full-time job, I decided to not pursue these areas yet. Hopefully, I’ll get to go back to this one day.

Let me begin by saying, there’s a lot of information packed in tweets. So, we must choose to find a balance of extracting information and cleaning data in an almost cyclical process. For example, the hashtags are followed with text. If we remove all punctuation like we did with location and keywords, we’d lose the ‘#’s.

#, @ and 😊!

I wrote functions and then counted the number of mentions, hashtags and emojis used in the tweets, then grouped them by class to see if there were obvious patterns to their usage.

def hash_count(string):
    words = string.split()
    hashtags = [word for word in words if word.startswith('#')]
    return len(hashtags)
tweets['num_hashtags'] = tweets['text'].apply(hash_count)
tweets['hashtags'] = tweets['text'].apply(lambda x: [x for x in x.split(" ") if x.startswith("#")])

For emojis I used the emoji package:

import emoji
def emoji_count(tweet):
    tweet = emoji.demojize(tweet, delimiters=('__','__'))
    pattern = r'_+[a-z_&]+_+'
    return len(re.findall(pattern, tweet))
tweets['emojis'] = tweets['text'].apply(emoji_count)

A quick visualisation (below) showed that non-disaster tweets had a higher count of hashtags, mentions, and use of emojis. All three showed a very similar distribution of the counts of hashtags and mentions in both the classes.

Colloquialisms

Then, came the challenge of internet-speak! I used this dictionary of abbreviations, which was a life saver, to expand on common abbreviations! Look at the dictionary, it’s pretty complete and I highly recommend it. Also, I used the contractions package which helped expand common contractions.

import contractions
def expand_contractions(tweet):
    return contractions.fix(tweet)
tweets['text'] = tweets['text'].apply(lambda x: expand_contractions(x))

Then, I used this to help remove emoticons, symbols, pictographs, transport and map icons, and flags from the tweets.

def demojize(tweet):
    emojis = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags = re.UNICODE)
    cleaned_text = emojis.sub(r'', tweet)
    return cleaned_text
tweets['text'] = tweets['text'].apply(lambda x: demojize(x))

Almost there! Now all contractions and abbreviations are expanded, symbols that we could predict have been cleaned, and we’ve removed all punctuation, accents, mentions, hashtags, html text, urls, numbers, and extra white spaces, some of which are left behind because of the cleaning process. Finally, I changed all the tweets to lowercase. The next logical step is to tokenize the words, but prior to doing so, I took some time to conduct another round of exploratory data analysis to see if the cleaned words in the tweets could shed some insight into the different choice of language between the two classes.

I looked at the number of words, the different types of nouns, and the lengths of the words used by the two different classes and concluded there still wasn’t a great difference between the two classes. So, I looked at these two features not to help develop the classification model but rather to help form a customised list of stop words. Stop-words are words that are used in language which don’t have much importance, such as ‘a’, ‘the’, ‘then’ to name a few. I wanted to create a list of custom stop words because the size of this dataset, roughly 7000 tweets, is rather small. So by creating a custom list of stop words, I can develop a smaller and more specific list of unimportant words. The idea is to identify the stop words, remove them from the entire vocabulary, the corpus, to allow for the machine learning algorithms to focus on only important words when developing the classification models.

My EDA highlighted that words with 2 or less characters didn’t seem to make a difference between the two classes, so I first removed words where they were two or less characters and any empty space left behind by their removal.

Then, I used the Count Vectorizer, check out the documentation here on the corpus. As this was my first attempt at NLP, I decided to also try and use the TFIDF Vectorizer to learn more about their differences. My goal was to identify the most common words that we could use to create our stop words list. This took some time, mainly because I wanted to see the scope of these vectorizers. Check out my notebook. Essentially, the final step of the cleaning process is to remove the stop words identified. I’ll go into more detail later about tokenizing and lemmatizing the tweets and how I created my binary classification models, but for now, I hope you this gives you some good ideas on how you might want to wrangle social media text like tweets!