Elementary, My Dear Watson! An Introduction to Text Analytics Using Sherlock Holmes Stories

by Michael Fire (originally published in dato.com)

As a data scientist, analyzing text corpora is one of the more interesting tasks I like to do. By analyzing various text sources, we can learn a lot about the world around us. In this blog you’ll find some of my favorite resources to learn about text analysis and my own example tutorial using the Sherlock Holmes stories.

Here are some of my favorite blog posts that do this:

Recently, someone asked me, “How can I start learning about NLP?” I recommended that he start reading about the subjects mentioned above and try to solve several of Kaggle’s competitions, such as Bag of Words Meets Bags of Popcorn, and StumbleUpon Evergreen Classification Challenge. Additionally, I decided to write the following iPython notebook, which hopefully will help him and other developers like him to enter the world of NLP.

In this notebook, “Text Analytics Tutorial using Sherlock Holmes Stories,” I present a practical way to learn how to analyze large text collections. We start with downloading Sir Arthur Conan Doyle’s collection of Sherlock Holmes stories. We then utilize Python Regular Expression and NLTK packages to perform a very simple analysis of the Sherlock stories, such as counting the number of sentences and counting the number of times a specific word appears in all the stories. We then move to perform some NER. I demonstrate how it is possible to reconstruct the social network of Sherlock, using the characters names’ from Wikipedia or by using Stanford Named Entity Recognizer software. Next, we move to topic models, using GraphLab Create’s Topic Model Toolkit and pyLDAvis, where I demonstrate how to analyze paragraphs in the Sherlock Holmes stories. Lastly, I show how Word2Vec can be used to find similarly styled paragraphs.

Topic Model for Sherlock Holmes stories.

My main goal in writing this notebook is to give some practical (and hopefully interesting) examples to show how it’s really easy and straightforward to perform NLP with today’s existing set of tools. I really hope that after reading this tutorial, you will try to do some NLP yourself and discover some intriguing insights on different datasets.

Originally published at blog.dato.com.