NLP From The Underground Part II
Having fun with statistics and the greatest works of 19th-century literature
This article assumes you have read NLP from the Underground Part I, and picks up where the previous article left off. To recap: The corpus under consideration is Dostoevsky’s five masterpieces: Notes from the Underground, Crime and Punishment, The Idiot, The Possessed, and The Brothers Karamazov. To quantify sentiment we used the NRC emotion lexicon and the AFINN sentiment lexicon.
The complete code for this analysis can be found on Github.
Natural language processing is great. Reading great books is great. Let’s continue our analysis.
Machine Learning with the NRC lexicon
So far, the most context introduced into the analysis has been at the level of bigrams. Using AdaBoost classification, I will attempt to broaden this to the level of sentences. I have begun by creating one big list of the sentences from each book. From this list, I have taken two random samples of size 100 and size 30. These will respectively be our training and test sets.
I have manually labelled each sentence as either conveying negative sentiment (-1), being neutral (0), or conveying positive sentiment (1). We’ll load the labelled sentences back in as follows:
The data frames we’re working with now look like this:
Using the NRC lexicon, the features the algorithm will learn from are the number of words within the sentence that fall into each NRC emotion/sentiment. The cleaning and preparation of the data for this step are similar to that of the previous NRC section, just that we are now considering full sentences:
Feeding the prepared training and test data into ScikitLearn’s AdaBoostClassifier with 70 estimators, I received a training score of 0.72 and a test score of 0.7. Works for me.
Before going further, a caveat is necessary. The model’s predictions will be an imperfect measure of sentiment. For more reliable results, I could have used larger training and validations sets labelled by multiple people other than myself. I may even do this if I find the time. But for now, I believe the model is a strong enough tool for us to pry out insights from the data. Just as long as we keep its weak points in mind.
Let’s use the model to predict the sentiment of each sentence in our corpus.
The following plots show the trajectory of each novel’s sentiment, according to our model. Each data point plotted is the sum of sentiment from 50 sentences. The x-axis represents narrative time.
And here are the mean scores produced by our model. A score of 1 would indicate every sentence is positive, -1 would indicate every sentence is negative.
By this measure, The Brothers Karamazov has the highest positive sentiment, and Notes from the Underground has the highest negative sentiment. Wonderful! I found reading Brothers to be the most uplifting experience of the bunch and Notes to be the most depressing.
Statistical Inference with the AFINN Lexicon
Now we’ll use inferential statistics to see how the five masterworks compare to other great works of the 19th century. The question I want to answer is: Using our measure of positive and negative sentiment computed in part I with the AFINN lexicon, is the mean sentiment within each of Dostoevsky’s masterpieces less than the mean sentiment of the average great work of 19th-century literature?
Right away, some clarifications are necessary. The category “great works of 19th-century literature” is not well defined. There is no universally agreed-upon list of such books. But we can achieve a good approximation that most people would agree with. The website goodreads.com has a list of the 1076 highest ranked literature from the 19th century. The rankings are an aggregation of Goodreads user’s rankings.
I used the beautifulsoup4 package to scrape all titles from the list:
An issue became apparent when I looked at the titles: many of them were works by Dostoevsky. Luckily, Goodreads also has a page listing the highest-ranked books by Dostoevsky. The following code removes all Dostoevsky titles from our greatest books list.
To perform the statistical tests needed to answer our question, we must take a random sample of the titles. In the code below, I have taken a sample of 60 titles. However, some of the books are not available in an English digital format, rendering them incompatible with our analysis. The end result is a sample of 42 titles, which is about what I was aiming for.
Note: This was the first time I’ve scraped an actively changing website. So you can imagine my surprise when, giving this article a final review, I found that the code produced a different set of titles. If anyone is aware of a way to scrape from actively changing websites in a more reproducible manner, please let me know.
Below is the list of titles that make up the final sample. The method used to get the contents of each book is the same as getting the contents of Dostoevsky’s books from part I.
That this point, it’s necessary to revisit the question we’re trying to answer. The question we are really asking is now: Is the mean sentiment within each of Dostoevsky’s masterpieces less than the mean sentiment of the average book on the Goodreads’ greatest books of the 19th-century list that is available in an English digital format? You may no longer think our question is meaningful (assuming you thought the first question was meaningful). I’ll note that only obscure titles lack representations that lend themselves to our analysis. All of my picks for the greatest books of the 19th century are on the list and are available in an English digital format. I’m still interested.
For each of Dostoevsky’s masterpieces, our null and alternative hypothesis are:
The significance level I will be using is 0.05. Since I’ll be testing for each of the 5 masterpieces, a Bonferroni correction updates our significance level to 0.01.
Let’s apply AFINN scores to our sample and examine some properties:
Minimum in Sample: -0.004975
Maximum in Sample: 0.082889
Sample Mean: 0.028879
Sample Standard Deviation: 0.021198
To confidently interpret the results of the test, the following assumptions must be met:
- Observations are independent — I think this is safe to assume. It’s possible that an author of one book on the list could be influenced by other books on the list. But I don’t believe such influence will be significant enough to corrupt our test.
- The data must be approximately normally distributed — I will perform a normality test on my sample data with a significance level of 0.05. If the test returns a p-value less than 0.05, I’ll have to conclude the data is not normally distributed. Otherwise, we can continue with the assumption that we are dealing with a normal distribution. Shown in the code block below, a normal test on my sample produces a p-value of about 0.18. We can proceed.
The assumptions are met, we can perform our tests:
Here are the results in a nice table:
Right on. None of our null hypotheses stands a ghost of a chance. We can conclude that the mean sentiment within each of Dostoevsky’s masterpieces less than the mean sentiment of the average book on the Goodreads’ greatest books of the 19th-century list that is available in an English digital format. It is your decision whether or not this generalizes to any more interesting questions.
Thank you for reading, friends.
Nielsen, F. Å. (2011) ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings, pages 93–98.
Mohammad, S. M. and Turney, P. D. (2013). Crowdsourcing a Word–Emotion Association Lexicon. Computational Intelligence, 29 (3), pages 436–465.
Mohammad, S. M. and Turney, P. D. (2010). Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text.
Silge, J. and Rabinson, D. (2020). Text Mining with R. O’Riley. https://www.tidytextmining.com/index.html