Boot Camp Capstone Project: Shakespearean Data Science

Published in

Data & Verse

6 min readMar 16, 2020

Boot Camp Capstone Project: Shakespearean Data Science

Three weeks before the end of boot camp, we had to get to work in earnest on our capstone projects. First step was to define our topic. I had many ideas, but they fell into two camps: the impractical and the uninspiring.

I decided to work on a large (10MB) chunk of text from StackOverflow, the community Q&A site for technology professionals. The idea was to set up the text on a Google Cloud mySQL Server so I could develop and demonstrate my SQL skills for potential data analytics employers, then fit a predictive model of some kind. I quickly ran into two problems. One, the topic was boring. Two, these cloud platforms are a pain to use and their documentation is terrible.

I began searching for a new dataset and soon stumbled upon a .CSV file of the plays of William Shakespeare. Since I have a B.A., M.A., and some doctoral work in art history, and a creative writing habit, this appealed to me. I quickly decided to try to do my own ‘analytics’ section using the Pandas code library, then test whether classification algorithms could distinguish text between tragedy and comedy plays. Most important, this was a chance to work with machine learning algorithms on more comfortable cultural history turf: natural language processing (NLP) and Shakespeare.

Preprocessing the text

The machine learning algorithms operate through numbers, so we have to clean our text up so it can eventual become vectorizers that can be fit with our algorithmic models. Here are a few steps I took to preprocess the Shakespeare text. Depending on the situation, other steps may be necessary.

· Put all text in lowercase

· Remove all non-alphanumeric characters

· Remove all punctuation

I did not carry out the following two steps because I knew that my later classification models would test through pipelines and GridSearch if ‘stop words’ needed to be removed and that the models would have vectorization built into them (CountVectorizer and TFIDFVectorizer in pipelines and GridSearch).

· Remove ‘stop words’

· Tokenize and vectorize the text

Also, out of curiosity, I didn’t at first cut the words in the text down to the root (lemmatize) and I found that it was not necessary for this project. My sentiment analysis and classification models came out the same. Perhaps this was due to the size of the corpus as well as the consistent author.

Shakespeare analytics

Using nothing than Python, the Pandas library, and a jupyter notebook file, I carried out extensive Exploratory Data Analysis (EDA) as one would do for any data science project. The slides below are meant to show how the EDA process helped me get to know the Shakespeare dramatic corpus. I think even more extensive EDA on the text could produce useful results as an addition to traditional literary scholarship.

Sentiment analysis

My next phase of this project was to apply sentiment analysis to the text to see if it could be an accurate and useful tool. For this section, I used NLTK, or Natural Language Toolkit, and code library for text processing used for Natural Language Processing (NLP) by data scientists, for tasks such as classification, tokenization, stemming, tagging, parsing and semantic reasoning.

The sentiment analysis tool I used from NLTK is called VADER: Valence Aware Dictionary and sEntiment Reasoner. It is based on lexicons of sentiment-related words. Each lexicon word is rated positive, negative, or neutral. VADER was developed for social media monitoring. It is used in what is called opinion mining, an approach valuable in monitoring consumer sentiment in forums like social media monitoring, feedback forms, blogs, websites, surveys, etc.

VADER ap words to sentiment to create a dictionary of sentiment. Words are given a score between -1 and 1 (negative to positive). These valence scores represent words mapped to emotional intensity and are average across a text to produce a text score.

I applied the VADER sentiment analyzer to five different plays: one tragedy, two comedies, a history play and a comedy that is sometimes categorized as a romance. The sentiment scores match the literary classifications the plays usually receive, including the “inbetween” status that history plays and romances have.

This result greatly interests me. I think the use of sentiment analysis across the complete plays, both at the granular, line-by-line level, and at the play or category level, could be intriguing. I am interested in how these analytical tools can complement traditional literary scholarship.

Classification modeling

The final section of my project was to fit some classification models to see if they could classify text into buckets such as history or comedy. In one way, it was a necessary task to fulfill the project requirements — to fit model, but it was also a test of whether the traditional definitions could serve as solid class definitions in an empirical model.

Below is the code I used to pull the tragedy dataframe together from the original Shakespeare dataframe and add a column for play ‘type’. The Pandas library is powerful and learning how to use it better is a big post-course goal for me.

Of the four models I fit, the two that worked the best were logistic regression and random forest classifier. I chose logistic regression as my best model, thinking that its performance was due to its strength as binary classifier. I did find it odd that the l1 and l2 penalty terms did not improve performance. Also, there was no performance difference Count Vectorizer and TFIDF Vectorizer. I need to research more how to tune these models for NLP.

Next steps

One thing I would like to do is write a function to perform sentiment analysis line by line through Shakespeare’s plays. That does mean over 100,000 analyses, so I must think the process through. I would like to create visualizations of sentiment through the progression of plays or through a particular character’s movement through a play and compare that to traditional literary scholarship. I think the EDA or analytics work could be applied to cultural history, too. Either way, functions need to be written to speed up the process as the value to me, initially, seems to require a large aggregate of analytics data.

Written by Matthew McDermott