During the early stage of this election cycle, what now almost seems like a lifetime; back when the campaigns were all about (at least the projection of) transparency, policy, and governance, then candidate Jeb Bush voluntarily released his official emails. The emails contained 1,884,843 lines of text containing more than half a billion words, 600 times bigger than the size of the Bible. Collectively those email captures the thoughts and actions of a man who, at that point had every chance to become our next President.
To gain insight into his mind we did some data “unscience” with his emails. This post uses some of the results of that work. It makes more sense if you read the original post (a 2 minute read) about Technology Aided Gut (TAG) Check: Emails and Influence.
Breaking the big files
We parsed the raw data files (available at American Bridge , compressed) into a tab separated (very large 3.2 GB) text file. The tab separated fields are as follows: from, to, date, subject, and the body. If an email contained forwarded messages (or a quoted message that is typically found in a reply email) we have considered them as part of the body. This will result in double counting in many cases, but since we only looked at the flow of the emails and not their content for our analysis, this redundancy did not matter.
Here’s the relevant section of the perl code that we used to compute influential people and companies.
The code is fairly self explanatory (especially if you are familiar with perl syntax). The two main functions are compute_influencers (line 105) and compute_companies (line 143). They implement our heuristics of identifying person and companies who influenced the governor over the years. We generate frequency distribution tables of unique influencer and companies.
As long as the source data can be organized in tab separated text this code can be used to detect influencers from any set of emails of a person. You are welcome to use the code for you own datasets, but at this time we cannot provide any warranty or support.
After generating the frequency distribution tables we displayed them as tag clouds. Tag clouds are a simple and efficient way to visually highlight the dominant data patterns.
If you are interested in generating tag clouds of your own here is a web app that can generate tag clouds from PDFs/Web pages/Text files/Google Docs and even images (for Docs privacy should be set to public). If you are a Google Docs user there is a great Google Docs Add on to generate tag clouds.
Originally published at blogs.dataunscience.org on July 24, 2016.