How I used machine learning to classify emails and turn them into insights (part 1).

Anthdm
Towards Data Science
6 min readApr 25, 2017

Today I wondered what would happen if I grabbed a bunch of unlabeled emails, put them all together in one black box and let a machine figure out what to do with them. Any idea what will happen? I didn’t.

The first thing I did was look for a dataset that contained a good variety of emails. After looking into several datasets, I came up with the Enron corpus. This dataset has over 500,000 emails generated by employees of the Enron Corporation, plenty enough if you ask me.

As the programming language, I used Python along with its great libraries: scikit-learn, pandas, numpy and matplotlib.

Unsupervised machine learning

For clustering the unlabeled emails I used unsupervised machine learning. What, how? Yes, unsupervised, because I have training data with only inputs, also known as features and contains no outcomes. In supervised machine learning we work with inputs and their known outcomes. In this case I wanted to classify emails based on their message body, definitely an unsupervised machine learning task.

Loading in the data

Instead of loading in all +500k emails, I chunked the dataset into a couple of files with each 10k emails. Trust me, you don’t want to load the full Enron dataset in memory and make complex computations with it.

import pandas as pdemails = pd.read_csv('split_emails_1.csv')
print emails.shape # (10000, 3)

I now had 10k emails in the dataset separated into 3 columns (index, message_id and the raw message). Before working with this data I parsed the raw message into key-value pairs.

This is an example of a raw email message.

Message-ID: ❤0965995.1075863688265.JavaMail.evans@thyme>
Date: Thu, 31 Aug 2000 04:17:00 -0700 (PDT)
From:
phillip.allen@enron.com
To:
greg.piper@enron.com
Subject: Re: Hello
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Greg Piper
X-cc:
X-bcc:
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\’sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

Greg,

How about either next Tuesday or Thursday?

Phillip

To work with only the sender, receiver and email body data, I made a function that extracts these data into key-value pairs.

def parse_raw_message(raw_message):
lines = raw_message.split('\n')
email = {}
message = ''
keys_to_extract = ['from', 'to']
for line in lines:
if ':' not in line:
message += line.strip()
email['body'] = message
else:
pairs = line.split(':')
key = pairs[0].lower()
val = pairs[1].strip()
if key in keys_to_extract:
email[key] = val
return email
def parse_into_emails(messages):
emails = [parse_raw_message(message) for message in messages]
return {
'body': map_to_list(emails, 'body'),
'to': map_to_list(emails, 'to'),
'from_': map_to_list(emails, 'from')
}

After running this function, I created a new dataframe that looks like this:

email_df = pd.DataFrame(parse_into_emails(emails.message))index   body           from_             to
0 After some... phillip.allen@.. tim.belden@..

To be 100% sure there are no empty columns:

mail_df.drop(email_df.query(
"body == '' | to == '' | from_ == ''"
).index, inplace=True)

Analyzing text with TF-IDF

Which is short for term frequency–inverse document frequency and is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. I need to feed the machine something it can understand, machines are bad with text, but they shine with numbers. Which is why I converted the email bodies into a document-term matrix:

vect = TfidfVectorizer(stop_words='english', max_df=0.50, min_df=2)
X = vect.fit_transform(email_df.body)

I made a quick plot to visualize this matrix. To do this I first needed to make a 2d representation of the DTM (document-term matrix).

X_dense = X.todense()
coords = PCA(n_components=2).fit_transform(X_dense)
plt.scatter(coords[:, 0], coords[:, 1], c='m')
plt.show()

That being done, I wanted to find out what the top keywords were in those emails. I made this function doing exactly that:

def top_tfidf_feats(row, features, top_n=20):
topn_ids = np.argsort(row)[::-1][:top_n]
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(top_feats, columns=['features', 'score'])
return df
def top_feats_in_doc(X, features, row_id, top_n=25):
row = np.squeeze(X[row_id].toarray())
return top_tfidf_feats(row, features, top_n)

After running this function on a document, it came up with the following result.

features = vect.get_feature_names()
print top_feats_in_doc(X, features, 1, 10)
features score
0 meetings 0.383128
1 trip 0.324351
2 ski 0.280451
3 business 0.276205
4 takes 0.204126
5 try 0.161225
6 presenter 0.158455
7 stimulate 0.155878
8 quiet 0.148051
9 speaks 0.148051
10 productive 0.145076
11 honest 0.140225
12 flying 0.139182
13 desired 0.133885
14 boat 0.130366
15 golf 0.126318
16 traveling 0.125302
17 jet 0.124813
18 suggestion 0.124336
19 holding 0.120896
20 opinions 0.116045
21 prepare 0.112680
22 suggest 0.111434
23 round 0.108736
24 formal 0.106745

All making sense if you look into the corresponding email.

Traveling to have a business meeting takes the fun out of the trip. Especially if you have to prepare a presentation. I would suggest holding the business plan meetings here then take a trip without any formal business meetings. I would even try and get some honest opinions on whether a trip is even desired or necessary.As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not. Too often the presenter speaks and the others are quiet just waiting for their turn. The meetings might be better if held in a round table discussion format.My suggestion for where to go is Austin. Play golf and rent a ski boat and jet ski’s. Flying somewhere takes too much time.

The next step was writing a function to get the top terms out of all the emails.

def top_mean_feats(X, features,
grp_ids=None, min_tfidf=0.1, top_n=25):
if grp_ids:
D = X[grp_ids].toarray()
else:
D = X.toarray()
D[D < min_tfidf] = 0
tfidf_means = np.mean(D, axis=0)
return top_tfidf_feats(tfidf_means, features, top_n)

Returning the top terms out of all the emails.

print top_mean_feats(X, features, top_n=10) features     score
0 enron 0.044036
1 com 0.033229
2 ect 0.027058
3 hou 0.017350
4 message 0.016722
5 original 0.014824
6 phillip 0.012118
7 image 0.009894
8 gas 0.009022
9 john 0.008551

What I got so far is interesting, but I wanted to see more and find out what else the machine was able to learn from this set of data.

Clustering with KMeans

KMeans is a popular clustering algorithm used in machine learning, where K stands for the number of clusters. I created a KMeans classifier with 3 clusters and 100 iterations.

n_clusters = 3
clf = KMeans(n_clusters=n_clusters, max_iter=100, init='k-means++', n_init=1)
labels = clf.fit_predict(X)

After training the classifier it came up with the following 3 clusters.

Because I now knew which emails the machine assigned to each cluster, I was able to write a function that extracts the top terms per cluster.

def top_feats_per_cluster(X, y, features, min_tfidf=0.1, top_n=25):
dfs = []
labels = np.unique(y)
for label in labels:
ids = np.where(y==label)
feats_df = top_mean_feats(X, features, ids, min_tfidf=min_tfidf, top_n=top_n)
feats_df.label = label
dfs.append(feats_df)
return dfs

Instead of printing out the terms, I found a great example on how to plot this graph with matlibplot. So I copied the function, made some adjustments and came up with this plot:

I immediately noticed cluster 1, had weird terms like ‘hou’ and ‘ect’. To get more insights about why terms like ‘hou’ and ‘ect’ are so popular, I basically needed to get more insight in the whole dataset, implying a different approach..

To know how I came up with that different approach and how I found new and interesting insights will be available for reading in part 2.

Code available on Github

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Anthdm
Anthdm

Written by Anthdm

Hacker — engineer — researcher — trader — athlete

Responses (15)