Fixing Words with Python
My Journey into Text Analytics: Text Pre-processing
I’ve just recently started a new module in University that covered Text Analytics. I am taught to take on the role of a word technician, which is pretty intriguing.
Text pre-processing is necessary for the grand scheme of text analytics. As much as it’s easy for humans to read and interpret chunks of text, it’s tougher for computers as they do not have the concept of language. Context is also key because it determines meaning, such as whether “I like apple” refers to the fruit or the company.
But first, before we try to interpret texts, we need to pre-process them! Here are the stages and definitions, along with examples.
Tokenization
When we receive a bunch of text to analyze, the first thing we would have to do is to break them into words and punctuations. You can also use python’s split()
function for this.
from nltk.tokenize import word_tokenizetweets = ["This year General Elections is really intense!","Wah GE queues sibeh long... #hot #sweaty", "I have been queueing for too long!", "It will be troubling if youths today do not vote wisely."]tokenized_tweet = [word_tokenize(tweet) for tweet in tweets]
Text Normalization
We normalize each word to prepare them to be uniformly processed. This is done by putting all the words on a level playing field. There are many ways to normalize text, but these are two popular methods.
Stemming
To stem a word is to remove suffixes or prefixes from the word. Even though this may result in invalid or unrelated words, such as “troubl” which stems from “troubling”, stemming is often processes within a shorter run time than lemmatization.
import nltk
from nltk.tokenize import word_tokenizeporter = nltk.PorterStemmer()tweets = ["This year General Elections is really intense!","Wah GE queues sibeh long... #hot #sweaty", "I have been queueing for too long!", "It will be troubling if youths today do not vote wisely."]
tokenized_tweets = [word_tokenize(tweet) for tweet in tweets]
stemmed_tweets = []for tweet in tokenized_tweets:
stemmed_tweets.append([porter.stem(w) for w in tweet])
Lemmatization
To lemmatize is to derive at the actual root word of a text. For example, the lemma(root) of “troubling” would be “trouble”. This is often more accurate because a corpus (dictionary) is referenced, but this usually results in a longer run time than stemming.
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizerWNL = WordNetLemmatizer()tweets = ["This year General Elections is really intense!","Wah GE queues sibeh long... #hot #sweaty", "I have been queueing for too long!", "It will be troubling if youths today do not vote wisely."]
tokenized_tweets = [word_tokenize(tweet) for tweet in tweets]
lemma_tweets = []for tweet in tokenized_tweets:
lemma_tweets.append([WNL.lemmatize(t,'v') for t in tweet])
print(lemma_tweets)
Stopwords Removal
Stopwords are words that do not add much meaning to a given sentence. These words include “the”, “a”, “i”. They often do not contribute to the overall sentiment of the sentence and are hence removed.
from nltk.corpus import stopwordsclean_tweets = []
stopwords_list = stopwords.words('english')
tweets = ["This year General Elections is really intense!","Wah GE queues sibeh long... #hot #sweaty", "I have been queueing for too long!", "It will be troubling if youths today do not vote wisely."]
tokenized_tweets = [word_tokenize(tweet) for tweet in tweets]
lemma_tweets = []
for tweet in tokenized_tweets:
lemma_tweets.append([WNL.lemmatize(t,'v') for t in tweet])for t in lemma_tweets:
clean_tweet = [w for w in t if w.lower() not in stopwords_list]
clean_tweets.append(" ".join(clean_tweet))
print(clean_tweets)
Noise Removal
Noises are things like hashtags, URLs, emojis and so on. They either do not add value to the sentence or do not hold meaning. Removing this will help to accentuate the keywords of the sentences in the data source. The following example will explore removing hashtags.
import rehash_tweets = ["Wah GE queues sibeh long... #hot #sweaty"]for t in hash_tweets:
t = re.sub('\s#[\S]+',"",t)
print(t)
Later on, I realize that ‘\s#[\w]+’ also works, though it means different. ‘\s#[\S]+' actually refers to removing any non-whitespace characters after the # symbol, but ‘\s#[\w]+’ refers to removing any letter or digit only. I guess using ‘\S’ is more comprehensive, so we could stick to that!
That’s not all, but we’ll take a break here
Text processing helps me understand how computers understand words. From this short tutorial, I was able to understand how to fix words in sentences, removing or replacing until we are left with the most valuable words.
Until next time!