AI: Read My Email
Email is an integral part of our digital society. Scandals like Donald Jr’s collusion, Hillary’s server, and Podesta’s emails are a few examples from recent memory that shine a light on how little we think about the technology when pressing send. We just do it.
Let’s take a step back and look at emails as a dataset. Could we see how the tone of my emails is related to the number of emails sent/received per day? What trends emerge from the data? I will provide the results for my own emails, and give you the code to try for yourself.
Approach: Get raw text with dates using python + thunderbird
We want to analyze emails with AI, and so we need access to a gmail account’s messages. First we export the emails to a local computer using Thunderbird. Thunderbird saves the exported files in .msi format. Next, we extract the email text and sent/received date and other information from the files using a python script. Next, we can use artificial intelligence to find the sentiment and other data for each of the emails. Finally, the data is stored in a SQLite database for analysis.
All in all, the program ingested 19,528 emails. I could have pulled a lot more, but this seems like a big enough number to do some analysis. The data looks like this:
Results
First let’s look at my email traffic in general. In the chart below, we see my email usage each day of the week.
I try and catch up on work emails Saturday evening, and so we see some emails there. My wife may feel differently, but it seems I don’t work as hard on Sundays as regular Monday-Friday work days. On Sunday the kids are home, and productivity goes down.
As we can see in the chart above, the sentiment of my emails is not correlated with the day of the week. Generally we see more neutral than anything. That makes sense. I’m an engineer, and there are lots of emails containing general work stuff. We see more positive sentiment than negative, because I love my job.
So if sentiment is not related to day of the week, how about length of the message?
The chart above indicates that negative sentiment is not really related to message length, but positive sentiment grows as message length increases. This rise in positive sentiment is accompanied by a drop in neutral sentiment.
What about sentiment by recipient? Well, 62 emails to/from my wife’s gmail address resolved to 51% positive, 4% negative and 45% neutral. That’s pretty good! However… 7 emails to/from her work account were 0% positive, 57% negative and 43% neutral. Amazing! The more tense email exchanges between my wife and I are via work email. The most negative emails were with Bluehost which I left, a service provider that I still use, and my old accountant who I left. Nice! Makes some sense.
I had a look at which contacts I had the most positive exchanges with. Some made no sense (e.g. 80% positive for Google Apps <apps-noreply@google.com>), but others were good examples: a neighbour, a friend, a client, etc. I think a few noreply messages scored so high was the simple fact that they used a sentiment analyzer to make sure the sentiment in their message was scored at 100% positive, and each message was them writing to me, and never me writing back.
Conclusions
It was fun to see sentiment of emails over time, and think about patterns in email data. We could have clustered emails by subject, recipient, or using the body of the email to cluster by content with a word embedding model. We could create webs of who talks to who and how much. The sky is the limit.
This example used a pretty low-quality sentiment analyzer. Google cloud, tensorflow, keras, and IBM all have better options for “real” production systems. The goal here was to get you an implementation that developers can try with little effort. As promised, below is the code with a few helpful queries:
Queries:
SELECT count(*) as overall, substr(date,0,4) as day, SUM (positive), SUM (negative), SUM (neutral) FROM emailData GROUP BY day;SELECT substr(date,0,4) as day, charlen, positive, negative, neutral from emailData order by day desc;SELECT count(*) as overall, charlen, SUM (positive), SUM (negative), SUM (neutral) from emailData GROUP BY charlen order by charlen ascSELECT count(*) as overall, SUM (positive), SUM (negative), SUM (neutral), efrom from emailData GROUP BY efrom
Code:
import mailbox, glob2, nltk, sqlite3, os
from textblob import TextBlobdef getbodyfromemail(msg):
body = None
#Walk through the parts of the email to find the text body.
if msg.is_multipart():
for part in msg.walk():
# If part is multipart, walk through the subparts.
if part.is_multipart():
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
# Get the subpart payload (i.e the message body)
body = subpart.get_payload(decode=True)
#charset = subpart.get_charset()
# Part isn't multipart so get the email body
elif part.get_content_type() == 'text/plain':
body = part.get_payload(decode=True)
#charset = part.get_charset()
# If this isn't a multi-part message then get the payload (i.e the message body)
elif msg.get_content_type() == 'text/plain':
body = msg.get_payload(decode=True)
# No checking done to match the charset with the correct part.
charsets = set({})
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
for charset in charsets:
try:
body = body.decode(charset)
except:
print("Hit a UnicodeDecodeError or AttributeError. Moving right along.")
return bodyconn = sqlite3.connect("emailData.db")
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS emailData (id INTEGER PRIMARY KEY, date TEXT, charlen INTEGER, efrom TEXT, eto TEXT, positive FLOAT, negative FLOAT, neutral FLOAT)')
conn.commit()
skipped=0
mboxfiles= glob2.glob('C://Users//Daniel//AppData//Roaming//Thunderbird//Profiles//**//**')for mboxfile in mboxfiles:
try:
print("Opening",mboxfile,"and found",len(mailbox.mbox(mboxfile)),"emails")
except:
continue
if len(mailbox.mbox(mboxfile))==0 or "sqlite" in os.path.basename(mboxfile):
continue
for thisemail in mailbox.mbox(mboxfile):
body = getbodyfromemail(thisemail)
body = " ".join(str(body).split())
neutral=0
positive=0
negative=0
try:
for sentence in nltk.sent_tokenize(body):
analysis = TextBlob(sentence)
if analysis.sentiment.polarity > 0:
positive+=1
elif analysis.sentiment.polarity == 0:
neutral+=1
else:
negative+=1
total=positive+negative+neutral
positive=float(positive)/total
negative=float(negative)/total
neutral =float(neutral) /total
emailData=[thisemail['Date'], len(body),thisemail['From'],thisemail['To'],positive,negative,neutral]
c.execute('insert into emailData (date, charlen, efrom, eto, positive, negative, neutral) values(?,?,?,?,?,?,?)',emailData)
conn.commit()
except:
skipped+=1
print("skipping an email, probably a Unicode thing")
conn.close()
That’s it. That’s all folks. If you like this post, then please recommend it, share it, or give it some love (❤).
Happy Coding!
-Daniel
daniel@lemay.ai ← Say hi.
Lemay.ai
1(855)LEMAY-AI
Other articles you may enjoy: