A data driven look at the reactions to the Mueller report
On Friday, March 22 news broke that Special Counsel Robert Mueller had finished his investigation and filed a report detailing his findings with Attorney General Bill Barr. In case you are not American (or have been living under a rock), this was major news (i.e. one of the few times the CNN “breaking news” chyrons are actually on point :P) and was anticipated eagerly by the media and Congress. Having observed the political discourse over the last few years and feeling the need to be pithy, I tweeted out this short statement joking about the potential impact of the report
On Sunday night, I wanted to check if there was a kernel of truth behind my rather cynical comment. More importantly though, I wanted to write some code in R and implement some text analysis. So I decided to analyze tweets Robert Mueller and gauge the national reaction to the report and its aftermath.
Briefly, I downloaded the most recent 15,000 tweets mentioning the word “Mueller” within them using the
search_tweets in the
rtweet package. In order to connect the location to these Tweets I used the
lat_lng function in the
rtweet package. Now, I was ready to clean up the text and assign a sentiment score to each Tweet.
I started by looking at the raw text data which gave me the idea of doing some primary data cleaning and removing obviously unnecessary text such as http links using the
gsub function in base R. Subsequently, I used the
unnest_tokens package in the
tidytext package to turn the data frame into one-row-per-term-per-document. I also removed unwanted stop words (for e.g. “I”, “the”, “and”) and characters taken up by formatting (for e.g. “—”, “-”) so that we only keep the words which truly represent the sentiment of the text. Finally, we used the AFINN lexicon to represent a sentiment score to each word in the Tweet and using a
summarize operation, calculated the average sentiment of the tweet. If you want to follow the steps I took in more detail, you can refer to my code snippet here or follow this wonderful blog post I often used.
Here are 3 broad takeaways from my analysis:
1. The reactions to the report seem to be fairly polarized in both directions — there’s a conspicuous lack of neutral reactions.
The overall mean of the sentiment scores were somewhat negative (-0.3) which could lead one to assume that Twitter users reacted fairly neutrally to the report. However, this is a common fallacy people fall into when they look at just 1 summary statistic to understand something. We must remember that the trick to understanding more about a variable is to look past a basic summary statistic and study the underlying distribution of the variable.
Hence, I used a histogram in order to learn more about the distribution of the overall sentiment around Robert Mueller on Twitter. From the histogram, I could easily see that the sentiment was fairly bimodal — there were large peaks around -2.5 and +2.5. In contrast, there was a sharp drop around sentiment score of 0 i.e. a neutral score. This tells me that the Twitterverse was far from neutral to the report; in fact it had fairly strong reactions — either because it was a controversial report or probably because Twitter in some ways self selects for people who have strong emotional reactions.
2. The coasts and particularly the west coast aren’t huge fans of the report or the subsequent analysis by Bill Barr.
Next, I decided to learn more about the location of the Tweets and dive into how different parts of America were responding to the report online. I used
ggplot to map each location tagged Tweet on to the US map and colored it by the sentiment — blue for negative tweets and red for positive ones.
Such a map shows us that the overall negative sentiment to the report seems concentrated to the blue west coast and specifically to Los Angeles and San Francisco regions. Interestingly, though there is a lot of reaction in the North East and it is overwhelmingly negative, you also see some highly positive sentiment Tweets in this region. It is possible that this is being driven by the media centers and politicians clustered in New York and Washington D.C who might be more likely to give neutral or even outright positive messages in an effort to signal their bipartisan or fair credentials.
3. The “influencers” seems slightly more positive about the report than the public at large
When we mapped the sentiment of the tweets on the US map above, we observed that within the generally negative sentiment toward the report in the North East; New York and D.C. had some clusters of positive Tweets. In the previous section, I hypothesized that this could be due to influencers — either in media or politics Tweeting neutral, positive or optimistic things about the report.
In order to investigate this more, I decided to compare the overall sentiment of the Tweets between the two groups. One of the columns returned by the
search_tweets function is “verified”. Although not perfect, I believe whether or not a Twitter user is verified is a good substitute for their influence since highly influential folks in media and politics likely have verified profiles. I plotted a box plot to see and compare the summary of sentiment scores across these two groups of users — verified and not verified.
I could easily see that on average verified users were more positive about the report than non verified users. Additionally, the 75% quantiles also showed that the reactions were less extreme in both directions for verified users. Perhaps, this shows that at least the more influential Twitter users were more restricted and less extreme in their reactions which is a hopeful sign. We saw many Democratic influencers tweeted out positive messages about the report and in some cases rather than attack the report, they simply demanded the whole report be made public which could also influence this. Below is an example of such a positive tweet by a highly popular Democrat which was captured in my analysis.
One of the worries for me in this analysis was that we would have a ridiculously small number of verified users which could skew the analysis. However, though there is a class imbalance and we have many more non-verified users on Twitter, our dataset had several hundred tweets from verified users which makes me believe the results more.
None of these results are very surprising to me — after all it has been very well documented that US is getting more and more partisan and the investigation was perceived to be largely political. I am also not going to comment on the implications of this report — whether anything big will come out of it and what it means for the President. I am no expert in such issues and I assume everyone reading this blog can easily pick up the Wall Street Journal and/or New York Times, read it and form their own opinions (I highly encourage you to do this!). Additionally, I want to point out that there are a number of drawbacks to this quick and dirty analysis — 1. I only considered roughly 15,000 recent tweets and the universe of Twitter users is much larger than that. 2. This is only data from one platform and it is possible that other platform users or folks who are busy with life are reacting differently. 3. The AFINN lexicon we used to assign sentiment is also not perfect and we might miss a lot of words which don’t have a score assigned to it.
However, to conclude, I did want to point out that this entire analysis — from when I got the intial idea to the final blog post took only a few hours and that’s a testament not to my intelligence but rather to the highly modular form in which R packages have been written as well as the clear documentation of the functions within. Hence, I want to specially acknowledge the creators of tidytext, tidyverse and rtweet for their excellent packages and dedicate this post to them as a small token of gratitude. I would love to hear your comments on the results and additional analysis you would like to see me do.