Skimming Through Econ Papers: an R Tutorial
In this brief blog entry, I share a quick, most likely unoriginal algorithm to quickly skim through a PDF. Just for the sake of fun, I will first share the motivation for it. Please skip to the section “The Word Cloud Algorithm” if you are in a rush. Otherwise, grab your drink and keep reading.
The Summer Break
On May 21, I finished the first year of the MSc in Economics (a 2-year program) and was really sick of everything (basically, except for football). But time cures all, right? After a two-month vacation, and having still 2 weeks to go, I decided to work on the masters’ thesis, due on May 2022. Indeed, the topic I liked most was the [apparently] never-ending increase in worker remittances — a huge entry of USD for Mexico in the past 4–5 years — which had caught my attention for a long while.
However, I had changed my mind over the summer. In fact, one topic that had caught me for an even longer while is General Equilibrium analysis. Especifically, I am interested in developing a Dynamic Stochastic General Equilibrium (DSGE) model. Also, the analysis that I want to perform revolves around the labor market. I learned a few models of the labor market in Macro II (i.e., chapter 11 of Romer (2018)) and would like to delve into them as I develop the general equilibrium model. This is a great tool — I believe — as DSGEs are good tools for evaluating monetary policy; also, if applied to the Mexican economy, one wants to consider the ubiquitous phenomenon of labor market informality.
Hence, I soon found myself immersed in a sea of doubts, information, what-the-heck-is-that, what-on-earth-is-this, etc. So I decided to start from the pretended end result: what has been written before, in this particular area of knowledge, by other folks pursuing the same program at the same school? I found 7 masters’ dissertations written before that matched my interests. Of course, I was not going to read the 7 works without knowing — beyond a reasonable doubt — that it is relevant to my inquiry.
Thus, I decided to create an algorithm, in R, that could brief me on the main topics of the 7 articles. Of course, reading the abstract is the most direct way, but there has to be more. So this algorithm will make a word count, then create a word cloud that complements the simple reading of the abstract.
The Word Cloud Algorithm
For this tutorial, you need to gear up. Firs off, let’s install the required packages:
Accessing the PDF file: package ‘pdftools’
Function pdf_text is part of the PDF utilities that come with pdftools, a powerful package that enables R to read PDF files.
Text mining: package ‘tm’
This package will do the heavylifting of analyzing the PDF files. We will convert the file into a ‘corpora’ object — a text represented as is, in natural language. This objects will be manipulated with the tm_map function, which will trim white spaces, convert all letters to lower, clean all punctuations, and remove all ‘stop words’. The latter are a set of most commonly used words (e.g., “this”, “an”, “the”, etc), and are not to be counted. Finally, this package contains the ‘DocumentTermMatrix’ function, which will create a vector with the frequencies for every term in the text.
A cloud of words: package ‘wordcloud’
This takes a bunch of words and their frequencies, and returns a nice cloud. This is for presentation purposes only, but is nice to have.
We have the tools, let us move on.
The first step is to select your desired PDF file. Allow me to pretend that we are interested in this file (https://www.cdc.gov/tobacco/data_statistics/evidence/pdfs/comprehensive-TCP-508.pdf) which is a CDC summary on tobacco control programs. You can check it out, the link is safe. Save it to your working directory, and name it whatever you want. I will name it: CDC_file.pdf
myfile <- “CDC_file.pdf” mytext <- pdf_text(myfile) ## Load the documentthebody <- Corpus(VectorSource(mytext)) ## convert the text (NLP) to corpora, the data type that ‘tm’ works with
The next steps will cleanse the text, to make it easy to read, process, and analyze. Even though we love R for its outstanding capabilities, it is still a machine, and does not understand natural language.
thebody <- tm_map(thebody, tolower) ## all letters are lower capthebody <- tm_map(thebody, removePunctuation) ## no punctuationsthebody <- tm_map(thebody, stripWhitespace) ## this trims all whitespacesthebody <- tm_map(thebody, removeWords, stopwords(“en”)) ##remove stop words
Now, we have a more legible version of our text. Let us extract the words and its count from it:
thebody$content ## extract the words from the body of our text
thematrix <- DocumentTermMatrix(thebody) ## this creates a vector of words with their frequenciesthematrix <- t(as.matrix(thematrix))frequencies <- sort(rowSums(thematrix), decreasing = T) ## and this sorts the words by frequency, from higher to lower.
The final result
Finally, we get to draw our cloud:
wordcloud(head(names(frequencies), 18), head(frequencies, 18), scale = c(2,1), colors = c(“black”, “blue”, “green”, “red”))
This is all for this tutorial, I hope it might help someone in the future. In case you do not like what the cloud looks like, use ?wordcloud and be as creative as you wish. In my case, I only wanted to skim through papers as quickly as possible before engaging in any reading.
Remember to wear a mask and get vaccinated.
Thanks for reading me. Before you go, please remember: the exercise presented in this article is only intended to illustrate one way in which I would analyze a text. It is neither an exhaustive nor a serious analysis. The opinions expressed in this article are exclusively my own, and represent nothing other than my own views.