A Tutorial of Text Mining in R Using TM Package

Among all things for the people working on Data Analytics, one thing they will surely come across is Data Mining. Data Mining is all about examining huge to extremely huge amount of structured and unstructured data to form actionable insights.

This article is your guide to get started with Text Mining in R using TM package. It explains enormous power that R and its packages have to offer on Text Mining. A person with elementary R knowledge can use this article to get started with Text Mining. It guides user till exploratory data analysis and N-Grams generation.

Important Terms:

Before we dig dip into Text Mining, we need to get familiar with some of the important concepts related to Text Mining.

a. TM package: R package for Text Mining [1]

b. Corpus & Corpora: Corpus is a large collection of text. It is a body of written or spoken material upon which a linguistic analysis is based. Plural form of Corpus is Corpora which essentially is collections of documents containing natural language text. [2]

c. Document Term Matrix (DTM): A Document Term Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. It has documents in rows and word frequencies in columns.

d. Stemming: Stemming is the process of converting words into their basis form making it easier for analysis e.g. Words like win, winning and winner are converted and counted to their basic form i.e. win.

e. Stop Words: These are most common words in a language that get repeated. However, they add little value to text mining e.g. I, our, they’ll, etc. There are 174 stop words in English.

f. Bad Words: These are offensive words which need to be removed before we start data mining.

With above introduction and basics, let’s get started with implementing Text Mining in R.

Step 1: Install & load necessary libraries. Out of these, TM is R’s text mining package. Other packages are supplementary packages that are used for reading lines from file, plotting, preparing word clouds, N-Gram generation, etc.

Note: If any of above libraries are not installed, use install.packages() to get those installed.

Set constants that are to be used multiple times. This is considered as good programming practice.

Step 2: Read text file contents [3]. Optional — Gather and display basic file attributes viz. file size, number of lines in file, number of words in file.

Step 3: Create file corpus, clean the corpus

Step 4: This step illustrates few basic exploratory data analysis steps that can act as reference for detailed exploratory data analysis.

Output is not shown.

Step 5: Visualize frequency of words occurring in text file by using word clouds. Following code snippet generates two word clouds to show un-stemmed and stemmed corpus word clouds:

Step 6: Last step of this guide is to generate N-Grams (uni, bi and tri grams) and plot histograms of top 10 occurring N-Grams.

Further steps could be use above generated N-Grams text mining activities like word predictions, etc.

References:

a. [1] TM package — https://cran.r-project.org/web/packages/tm/tm.pdf

b. [2] Corpus & Corpora — http://language.worldofcomputing.net/linguistics/introduction/what-is-corpus.html

c. Text file referred in this guide uses text dump of following WIKI page — https://en.wikipedia.org/wiki/Text_mining

--

--

Sanjay Lonkar
Text Mining in Data Science— A Tutorial of Text Mining in R Using TM Package

Data Science Enthusiast. Data Science Specialization certified from Coursera — John Hopkins University. Keen to explore more Data Science. Hands on R experience