AI: Read My Email

Email is an integral part of our digital society. Scandals like Donald Jr’s collusion, Hillary’s server, and Podesta’s emails are a few examples from recent memory that shine a light on how little we think about the technology when pressing send. We just do it.

Let’s take a step back and look at emails as a dataset. Could we see how the tone of my emails is related to the number of emails sent/received per day? What trends emerge from the data? I will provide the results for my own emails, and give you the code to try for yourself.

Approach: Get raw text with dates using python + thunderbird

We want to analyze emails with AI, and so we need access to a gmail account’s messages. First we export the emails to a local computer using Thunderbird. Thunderbird saves the exported files in .msi format. Next, we extract the email text and sent/received date and other information from the files using a python script. Next, we can use artificial intelligence to find the sentiment and other data for each of the emails. Finally, the data is stored in a SQLite database for analysis.

All in all, the program ingested 19,528 emails. I could have pulled a lot more, but this seems like a big enough number to do some analysis. The data looks like this:

The data we get is: Date, To, From, Length in Characters, Positive Sentiment, Negative Sentiment, and Neutral Sentiment. In each email, sentiment is calculated per sentence and then averaged for the whole message. This screenshot was taken during data processing.

Results

First let’s look at my email traffic in general. In the chart below, we see my email usage each day of the week.

Email traffic grouped by day of the week.

I try and catch up on work emails Saturday evening, and so we see some emails there. My wife may feel differently, but it seems I don’t work as hard on Sundays as regular Monday-Friday work days. On Sunday the kids are home, and productivity goes down.

Sentiment / day

As we can see in the chart above, the sentiment of my emails is not correlated with the day of the week. Generally we see more neutral than anything. That makes sense. I’m an engineer, and there are lots of emails containing general work stuff. We see more positive sentiment than negative, because I love my job.

So if sentiment is not related to day of the week, how about length of the message?

Sentiment according to message length. Results are grouped into bins for clarity.

The chart above indicates that negative sentiment is not really related to message length, but positive sentiment grows as message length increases. This rise in positive sentiment is accompanied by a drop in neutral sentiment.

What about sentiment by recipient? Well, 62 emails to/from my wife’s gmail address resolved to 51% positive, 4% negative and 45% neutral. That’s pretty good! However… 7 emails to/from her work account were 0% positive, 57% negative and 43% neutral. Amazing! The more tense email exchanges between my wife and I are via work email. The most negative emails were with Bluehost which I left, a service provider that I still use, and my old accountant who I left. Nice! Makes some sense.

Sentiment by contact was obtained by analyzing all the emails we exchanged in the dataset.

I had a look at which contacts I had the most positive exchanges with. Some made no sense (e.g. 80% positive for Google Apps <apps-noreply@google.com>), but others were good examples: a neighbour, a friend, a client, etc. I think a few noreply messages scored so high was the simple fact that they used a sentiment analyzer to make sure the sentiment in their message was scored at 100% positive, and each message was them writing to me, and never me writing back.

Conclusions

It was fun to see sentiment of emails over time, and think about patterns in email data. We could have clustered emails by subject, recipient, or using the body of the email to cluster by content with a word embedding model. We could create webs of who talks to who and how much. The sky is the limit.

This example used a pretty low-quality sentiment analyzer. Google cloud, tensorflow, keras, and IBM all have better options for “real” production systems. The goal here was to get you an implementation that developers can try with little effort. As promised, below is the code with a few helpful queries:

Queries:

SELECT count(*) as overall, substr(date,0,4) as day, SUM (positive), SUM (negative), SUM (neutral) FROM emailData GROUP BY day;
SELECT substr(date,0,4) as day, charlen, positive, negative, neutral from emailData order by day desc;
SELECT count(*) as overall, charlen,  SUM (positive), SUM (negative), SUM (neutral) from emailData  GROUP BY charlen order by charlen asc
SELECT count(*) as overall, SUM (positive), SUM (negative), SUM (neutral), efrom from emailData  GROUP BY efrom

Code:

import mailbox, glob2, nltk, sqlite3, os
from textblob import TextBlob
def getbodyfromemail(msg):
body = None
#Walk through the parts of the email to find the text body.
if msg.is_multipart():
for part in msg.walk():
# If part is multipart, walk through the subparts.
if part.is_multipart():
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
# Get the subpart payload (i.e the message body)
body = subpart.get_payload(decode=True)
#charset = subpart.get_charset()
# Part isn't multipart so get the email body
elif part.get_content_type() == 'text/plain':
body = part.get_payload(decode=True)
#charset = part.get_charset()
# If this isn't a multi-part message then get the payload (i.e the message body)
elif msg.get_content_type() == 'text/plain':
body = msg.get_payload(decode=True)
# No checking done to match the charset with the correct part.
charsets = set({})
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
for charset in charsets:
try:
body = body.decode(charset)
except:
print("Hit a UnicodeDecodeError or AttributeError. Moving right along.")
return body
conn = sqlite3.connect("emailData.db")
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS emailData (id INTEGER PRIMARY KEY, date TEXT, charlen INTEGER, efrom TEXT, eto TEXT, positive FLOAT, negative FLOAT, neutral FLOAT)')
conn.commit()
skipped=0
mboxfiles= glob2.glob('C://Users//Daniel//AppData//Roaming//Thunderbird//Profiles//**//**')
for mboxfile in mboxfiles:
try:
print("Opening",mboxfile,"and found",len(mailbox.mbox(mboxfile)),"emails")
except:
continue
if len(mailbox.mbox(mboxfile))==0 or "sqlite" in os.path.basename(mboxfile):
continue
for thisemail in mailbox.mbox(mboxfile):
body = getbodyfromemail(thisemail)
body = " ".join(str(body).split())
neutral=0
positive=0
negative=0
try:
for sentence in nltk.sent_tokenize(body):
analysis = TextBlob(sentence)
if analysis.sentiment.polarity > 0:
positive+=1
elif analysis.sentiment.polarity == 0:
neutral+=1
else:
negative+=1
total=positive+negative+neutral

positive=float(positive)/total
negative=float(negative)/total
neutral =float(neutral) /total

emailData=[thisemail['Date'], len(body),thisemail['From'],thisemail['To'],positive,negative,neutral]
c.execute('insert into emailData (date, charlen, efrom, eto, positive, negative, neutral) values(?,?,?,?,?,?,?)',emailData)
conn.commit()
except:
skipped+=1
print("skipping an email, probably a Unicode thing")
conn.close()

That’s it. That’s all folks. If you like this post, then please recommend it, share it, or give it some love (❤).

Happy Coding!

-Daniel
daniel@lsci.io
LemaySolutions.com