Ranking Keywords in Hillary Clinton’s Emails— D3 Word Cloud

Repository: https://gist.github.com/manishmshiva/44dd1fdcf74b8d5ac052c784014031bd
Visualisation: http://bl.ocks.org/manishmshiva/44dd1fdcf74b8d5ac052c784014031bd

We all had enough of Hillary Clinton’s emails servers. I don’t think many really cared about it, but the archive is quite an interesting dataset to work with.

In this project, we are going to get the archive, grab the subject line of the emails, rank the keywords and put them in a D3 word cloud.

Getting the data

Lets begin by getting the data from Kaggle. Once you have obtained the dataset and extracted it, you should see a few CSV files in the folder. The one we are most interested in is the “emails.csv” file.

The emails.csv is quite heavy at 25MB.To quickly read the data, you can import the file into excel. In the file, the field we are interested in is titled “MetadataSubject. Lets get the data.

Extracting the data

  1. Install Node.js & NPM. Create a new directory and copy the file “emails.csv” into the folder.
  2. Install csvtojson module globally and convert the file to json using the following commands.
> sudo npm install csvtojson -g
> csvtojson ./emails.csv > emails.json

Now that we have the data converted to JSON, its easier to work with. Lets create an array by extracting the MetadataSubject field and filtering empty values.

data
.map(function(d) {return d[“MetadataSubject”]})
.filter(function(d) {return d ? true : false;});

Next, all the subject lines can be converted to a single string from which the keywords can be extracted after removing the stop-words. The keyword-extractor module does that perfectly.

var extractor = require(‘keyword-extractor’);

The extraction returns an array of keywords with duplicates. An additional filter of words less than four is applied to improve the quality of the result. Then we rank the keywords by converting them into an object with rank values based on frequency.

var _final = {};
extractor.extract(processedString, {
language: “english”,
remove_digits: true,
return_changed_case: true,
remove_duplicates: false
})
.filter(function(d){return d.length >3}).forEach(function(d){
_final[d] ? _final[d]++ : _final[d] = 1
});

Finally, the ranked object is converted into an object array for sorting and slicing the top 25 keywords.

var result = [];
for(o in _final) result.push({key:o,value:_final[o]});
result = result.sort(function(a,b){return b.value-a.value}).slice(0,25);
fs.writeFileSync(‘./keywords.json’,JSON.stringify(result));

Visualising the data

Now that we have our data, lets build a word cloud to visualise it. There are a number of ways to build a word cloud, but for this project, we are going to use the d3 pack layout.

The word cloud that we are going to build is a hack around the pack layout. Instead of the circles in the pack layout, we are going to add text that gets arranged making it a word cloud.

The complete document to render the word cloud can be found here. Lets look at the core concepts.

We begin by initialising a pack layout.

var pack = d3.layout.pack()
.size([width,height])
.padding(-10)

Next, we need to get the data. D3.json methods does the job for us to fetch the data and return the list of ranked keywords that we have prepared.

Once we have the data, we have to run it through the pack layout which will convert the key value pairs into a format that can be used to generate a pack layout.

var nodeData = {title:”RankedKeywords”,value:100,children:keywords};
nodeData = pack.nodes(nodeData);

Lets scale the font of the text based on their ranks to have words of variable sizes.

var nodeR = keywords.map(function(d){return d.value});
var fontScale = d3.scale.linear()
.domain([d3.min(nodeR),d3.max(nodeR)])
.range([1,4])

Now we need to append a <g> tag with the class name ‘node’. After that we can append a text to it. By assigning the result of the fontScale function with rank as the parameter, the words are visualised into a word cloud.

svg.selectAll(‘.node’)
.data(nodeData)
.enter()
.append(‘g’)
.attr(‘class’,’node’)
.attr(‘transform’,function(d){return “translate(“+d.x+”,”+d.y+”)”})
.append(‘text’)
.text(function(d){return d.key})
.attr(‘text-anchor’,’middle’)
.attr(‘fill’,’#000')
.style({
‘font-family’:’”Open sans”,sans-serif’,
‘font-size’:function(d){return fontScale(d.value)+’em’}
})

The final output should look like this:

Final output of ranked keywords from Hillary clinton’s emails

Additional comments on code can be found in the gist. If you have questions, send me an email to manishshivanandhan@gmail.com.

Manish is a Full Stack Web Developer, Machine Learning Engineer & Visualisation Expert. He regularly speaks on topics of Machine Learning, Web Application Development and Visualisations. Read his full profile at www.manishshivanandhan.com.