Data Preprocessing in R

I have recently got my hands dirty with Natural Language Processing (NLP). I know, it’s a little late to the party but I am at least in the party!

To start with a general overview, I implemented quite a few tasks related to NLP including Text Classification, Document Similarity, Part-of-Speech (POS) Tagging, Emotion Recognition, etc. These tasks were made possible by implementing text pre-processing (noise removal, stemming) and text to features (TF-IDF, N-Grams, Topic Modeling, etc). I implemented these in both R and Python. So, I will try to jot down my experiences in both of these environments. Therefore, I will write this as a blog series, wherein each blog will discuss only one particular thing implemented in one particular environment.

This post is going to be an account of conducting data pre-processing and visualizing the words that are used most frequently in TED Talks.

About the dataset

It contains information about all the audio-video recordings of TED talks uploaded to the official TED.com website until 21st September 2017. There are two major files in the dataset. The first one has information about talks like number of views, comments, speakers, titles etc. The second one on the other hand consists of the transcripts of those talks.

So, let’s dive in!

set.seed(5152)
ted_main <- read.csv(‘ted_main.csv’)
transcripts <- read.csv(‘transcripts.csv’)
ted_talks <- merge(x = ted_main,y = transcripts, by = ‘url’)

It’s important to set seed (but optional) so as to get reproducible results for everyone who runs the code. I am also merging the files by the ‘url’ column present in both the files and because it is uniquely identifying a talk.

Moving on to pre-processing.

library(tm)
library(SnowballC)
corpus <- VCorpus(VectorSource(ted_talks$transcript))
##Removing Punctuation
corpus <- tm_map(corpus, content_transformer(removePunctuation))
##Removing numbers
corpus <- tm_map(corpus, removeNumbers)
##Converting to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
##Removing stop words
corpus <- tm_map(corpus, content_transformer(removeWords), stopwords(“english”))
##Stemming
corpus <- tm_map(corpus, stemDocument)
##Whitespace
corpus <- tm_map(corpus, stripWhitespace)

With all the steps above, we complete our data pre-processing. Most of the steps are self-explanatory in the sense that we are removing punctuations, numbers, stop words from our text. We are also converting our text to lowercase and subsequently, stemming it. Stemming is the process of stripping suffixes (“ing”, “ly”, “es”, “s”, etc). The tm package in R presents methods for data import, corpus handling, data preprocessing, creation of term-document matrices etc. The SnowballC package is used for stemming.

We will now visualize our pre-processed data.

# Create Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
# Removing all terms whose sparsity is greater than 95% 
corpus <- removeSparseTerms(dtm, 0.95)

There are two important things in the code block above. First is the creation of Document Term Matrix (DTM) and second is the removal of sparse terms. The DTM is a matrix that lists all occurrences of words in the corpus, by document. That is, the documents form the rows of the matrix and the terms form the column. Therefore, if a word occurs in a document, the corresponding entry for that row and column will be the number of times that word occurred in the document. To illustrate it better, say I have two documents:

Doc 1: I study in National University of Singapore

Doc 2: I live in Singapore

It’s DTM would look like:

A Sample DTM

The next important funda that I want to discuss is removal of sparse terms.

When I check what’s in my DTM, I get this:

<<DocumentTermMatrix (documents: 2467, terms: 81626)>>
Non-/sparse entries: 1146779/200224563
Sparsity : 99%
Maximal term length: 51
Weighting : term frequency (tf)

It says that there are 2467 rows, more than 80000 columns. Of this matrix, there’s 99% sparsity. I found a fantastic explanation of what’s sparse terms and why is there a need to reduce the sparsity through this link. To sum it up here, a 99% sparse matrix means that 99% of the values on the matrix are zero and 0% sparsity means, the matrix has no zero value and is densely populated with values. So, in my previous code block, I gave a value of 0.95 (in removeSparseTerms function) in my DTM. The result is self-evident:

<<DocumentTermMatrix (documents: 2467, terms: 1783)>>
Non-/sparse entries: 797530/3601131
Sparsity : 82%
Maximal term length: 13
Weighting : term frequency (tf)

Number of columns got reduced to 1783 and my sparsity went down to 82% from 99%. This means that there are lesser zero values in my matrix.

I will now proceed to visualize the highest occurring terms in my corpus.

library(data.table)
library(ggplot2)
library(ggthemes)
colS <- colSums(as.matrix(corpus))
doc_features <- data.table(name = attributes(colS)$names, count = colS)
ggplot(doc_features[count>8500],aes(name, count)) + geom_bar(stat = “identity”,fill=’lightblue’,color=’black’)+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+ theme_economist()+ scale_color_economist()

Clearly, the highest frequency words are:

High frequency words in the corpus

There are few more pre-processing steps that can be done like removal of extreme rare words, most commonly occurring words. But the steps mentioned above, cover majority of the pre-processing needs. Only thing that I feel could be personalized is the sparsity. But then again, it depends on your use-case. It’s okay If it’s just for understanding what’s sparsity but if it’s for, say Text Classification — which I will cover in my subsequent blogs, then sparsity matters a lot. A dense matrix would carry a lot more information about your corpus then a sparse matrix. Even, my Mac (Macbook Pro i5 2.6 GHz, 8 GB Ram) used to crash while running Text Classification on a highly sparse matrix. Anyhow, I will discuss these issues in much more detail in the upcoming blogs.

Thanks for reading and stay tuned!