Hillary emails: not just useful for winning elections

Chris Savvopoulos

Published in

Red Sift Outbox

6 min readDec 6, 2016

… but also, for freeing your inbox from scheduling requests.

What?

Here at Red Sift we are using the Hillary Dataset to create a virtual secretary for your inbox. Whenever friends and colleagues send you an email to plan something with you, we will understand their email and take care of notifying you and adding it to your calendar.

How do you do it?

With Machine Learning and Natural Language Understanding. This is basically a structured way of telling a computer that, when a date-like phrase follows the words “how about coffee” the intent of that sentence is to schedule something. We used the Hillary emails (as well as the Enron emails — which were part of Enron’s prosecution and thus public) to get lots of real-world examples. The rest of the post will explain how it all works, and then show you how to build it yourself.

Machine Learning

… happens by example. Let’s consider a simpler problem: detecting somebody’s gender based on their name — a classification problem. We might decide that the last letter of a name is important with respect to its gender — that’s a feature. And if we had 1000 names for which we knew their gender — our labeled dataset — and only looked at the last letter, we’d realise that names that end in a, e, i are likely to be female, whereas k, o, s, and t predict male. Based on this insight, we could then build a simple equation that returns the probability that a name is female:

P(female | name) = 0.5 + 0.5 * (F_a + F_e + F_i) — 0.5 * (F_k + F_o + F_s + F_t)

Don’t get too scared by the formula! You don’t need it to follow this post, or build the software.

Where F_a, F_e, …, is a variable that equals to 1 if the last letter of the name is a, e, …, and otherwise, it’s 0. This function returns 50% chance for Jordan, 100% for Sofie, and 0% for Chris.

The last letter of a word is a useful but simplistic feature. In industrial applications of NLP, such as our virtual secretary, you’d typically use hundreds of more complex features — such as matching word patterns like “how about”. Then, you’d use machine learning methods, such as logistic regression, to learn the equation that predicts the probability of whatever it is you want to predict — scheduling intent, in our case.

Let’s build the Classifier

Fortunately, others have already implemented all this for you. Facebook recently published fastText and we will use it to train a simple classifier. Here’s how you can build your state-of-the-art scheduling classifier in 5 easy steps:

#1 Install fastText

#2 Download our training set into the fastText/ directory

#3 Train the model and try it out:

Then enter: “how about coffee tomorrow at 5pm?” and hit enter.
You should see __label__Scheduling_Intent — congrats!

Machine Learning in your Gmail!

We’ve built our our own platform for making these kind of applications and many others, Red Sift. It’s a cloud-based data processing platform. You can implement apps on top of it to process your data into insights. You can connect it to any data source you want, and it connects especially well with Gmail. We are going to use it to create an app -a Sift!- that counts the number of emails with a scheduling intent in your Inbox. We’ll do that in two steps: first you will get Red Sift running on your machine, connected to your email and counting your email; then we will integrate it with fastText and count only the emails with a scheduling intent.

Red Sift + Email

#1 Install the Red Sift SDK

#2 Create a new project —IMPORTANT: make sure to select email-sift when creating the Sift.

#3 Now run it with:

This will start the Red Sift SDK, which creates a complete run-time environment, like our cloud service, so it can take a little while. After ~1 minute, it will open http://localhost:7438/ in your browser.

#4 Go to the ‘inputs’ tab by clicking on this button:

#5 Click on the “Google” button and Authenticate your account. This will connect your new Sift with your Gmail account. Remember that you are running this code on your local machine, so there is nothing to worry about.

#6 Click on the “Download” button — you should see the number next to “gmailEmails” increase from 0, unless your inbox is empty

#7 Next, we are going to run the Sift — click on Run button, in this panel:

#8 Finally, click on this button at the top, to display your Sift’s UI Frontend:

If everything went well, you should see something like this:

If it doesn’t work, try clicking on this icon, which resets the Frontend’s browser-side database:

Red Sift + fastText

Now that you have a basic Sift running and connected to your email, let’s get it classifying your email. We are going to use the classifier we trained before for that.

Open classifier-sift/server/parse.js in your favourite editor; this node, takes your emails as input, and outputs key-value pairs to two object stores: messages and threadMessages. Both have the same value: an object containing an email’s word count. Let’s use our classifier to add a boolean that tells us whether an email has scheduling intent!

#1 Drag the fastText/ folder from a few steps ago into the classifier-sift/server/ folder.

#2 Add this function into parse.js, at the top of the file, after use ‘use strict’:

#3 Finally, use that function to add a sched_intent boolean inside of variable “value”; change const wordsValue to look like this:

Now our data stores contain enough information to count whether our emails have a scheduling intent or not! Let’s change our Sift to only count emails with a scheduling intent — open classifier-sift/server/count.js and change the emailStats variable, so it filters out values without a scheduling intent:

Now re-run the computation from the SDK Dashboard; you should see your counts go down, perhaps even to zero. Send an email to yourself with “how about coffee tomorrow at 5pm?”, re-download your emails from the Input Tab, and re-run it. You should see the counter increase — congratulations on building your first Machine Learning Sift!

If you want some background information on programming with RedSift, you can go to our online documentation and video tutorials at docs.redsift.com.

Stay tuned! In the next parts of the series, we extract the event times from the email body, and build a Slackbot to send meeting suggestions to the user.