Text Analysis with R for Students of Literature
Our ability to access, process, and analyze large quantities of data has been increasing at a dizzying pace over the last few years. This data-driven revolution is fundamentally changing many professional and academic fields. Many people, especially the long-term practitioners in humanities and similar disciplines, find this change worrying, and in many ways exactly contrary to the spirit of these disciplines. Pouring over long and demanding texts, while internalizing them and becoming personally immersed in them, seems to be at the very core of what these disciplines are all about. And yet, as both a lover of humanities and a die-hard techy, I find this latest development incredibly exciting.
The title of this short book makes it eminently clear who the intended audience is: students of literature who are interested in using R for textual analysis. R is a very powerful programming language used for statistical analysis. Textual analysis is a very prominent aspect of modern data science, so there are many well-known and established tools and techniques that can help one with this task. However, the aim of this book is neither to teach R or programming, but to give the Literature students just the most basic tools needed to do some relatively straightforward textual analysis. The book jumps straight into the examples almost from the very first page. The obvious virtue of this approach is that you can start doing some interesting work rather quickly, and as long as your own research doesn’t depart dramatically from the examples given in the book you should be able to use the books as a reference and a primer for your own work. However, if you have some slightly more demanding problems that you are trying to work on, then after finishing this book you might want to go to a specialized book on R programming that will give you enough foundation to work on a larger variety of problems.
The book takes the freely available text file of “Moby Dick” and runs a variety of textual analysis on it: simple word count and word frequencies, correlations between various “special” words, context analysis, etc. In the latter chapters it moves from a single book to a corpus of books for more interesting look at themes across many texts. I found the last chapter on topic modeling especially fascinating, but way too brief. I guess I will now have to take a look at other sources to learn more about this line of analysis.
This books is very pedagogical in its style. Oftentimes the author would present two different solutions to a particular problem — one using a very simple yet hard to understand R command, and another broken down into several self-contained chunks. I find this approach very educational and helpful.
Even though this is primarily a book intended for literature students, I would actually strongly recommend it to anyone interested in text mining, text analysis and natural language processing. It is a very gentle and approachable introduction to the whole world of textual analysis.
**** Electronic version of the book provided for review purposes. ****
Originally published at www.tunguzreview.com on July 14, 2015.