NLP with LDA: Analyzing Topics in the Enron Email dataset

Photo by Quintin Gellar on Pexels.com

It has been over 18 years since the Enron collapse.

Yet, the now popular email dataset, made public by the Federal Energy Regulatory Commission (FERC), is one good thing that keeps on giving.

It has been a great resource for many data analytics and machine learning exploration, particularly in the domain of Natural Language Processing.

In today’s post, I’ll be exploring the key topics discussed in these emails and showing you how you can do that in a given corpus.

Are you curious about what we may find?

First, it’s worth noting that Topic Modeling is one of many techniques for finding patterns across textual documents. Matthew Kirschenbaum’s Distant Reading discusses this in detail.

According to the MALLET library documentation:

A topic modeling tool takes a single text (or corpus) and looks for patterns in the use of words … A topic to the computer is a list of words that occur in statistically meaningful ways — Getting Started with Topic Modeling and MALLET: Shawn Graham, Scott Weingart, and Ian Milligan

Getting and Understanding the Dataset

If you googled for the Enron dataset, you’ll see there are many sources for this data. The top result will likely point you to this site.

You can get your hands on the actual data here. Overall, it is a fairly large dataset containing roughly 500,000 emails tagged to150 people.

Below you can see there’s a folder for each person containing the usual email folders.

If we expand one of these folders we can see a list of email docs and a selected sample as shown below.

A couple of things jump out.

For one, we need to make some key decisions up front.

We need to decide on what folder(s) we want to work with. We also need to figure out how we go about cleaning up the email to focus on the body of the email.

It will be a worthwhile exercise, but for those who would rather skip that step, we thankfully have a different version of the dataset that has been cleaned in this way.

This Kaggle dataset accomplishes this for us and stores the final collection in CSV format. You will likely need a Kaggle account to download this dataset.

This is great, but keep in mind that it is still a huge dataset and you might consider slicing this file into multiple smaller datasets to manage how much computation you are doing in memory especially in the early exploration phase.

Antony Dm took this approach. You can check his post or GitHub page to learn how you can get a smaller slice of this dataset.

Let’s Get Started

I used Google’s Jupyter Lab to do this analysis as you may see from the interface. I might cover how to set Jupyter Lab in a separate post, but I bet there’s a great tutorial out there that does a solid job at that.

The first thing we want to do is get a peek into the dataset.

Pandas is a great python tool to do this. I import it and read in my emails.csv file. I don’t want the whole dataset so I grab a small slice to start (first 10,000 emails).

import pandas as pd
emails = pd.read_csv(‘emails.csv’)
email_subset = emails[:10000]
print(email_subset.shape)
print(email_subset.head())

I also want to see the structure (shape) and a preview (head()) of my dataset. Here’s what we see.

So we have 10,000 records with two columns. The first column represents where the email is pulled from and then the actual message in the email.

However, as we can see from the preview, the message portion needs to be cleaned up. How you go about this depends on what you are looking to analyze.

Antony Dm once again comes to the rescue. He has some great helper code to get you going if you are not inclined to try this on your own.

def parse_raw_message(raw_message):
lines = raw_message.split('\n')
email = {}
message = ''
keys_to_extract = ['from', 'to']
for line in lines:
if ':' not in line:
message += line.strip()
email['body'] = message
else:
pairs = line.split(':')
key = pairs[0].lower()
val = pairs[1].strip()
if key in keys_to_extract:
email[key] = val
return email

The code above as you can see looks to extract some key portions of the message. For a message, he extracts the ‘from’, ‘to’ and the email ‘body’.

def parse_into_emails(messages):
emails = [parse_raw_message(message) for message in messages]
return {
'body': map_to_list(emails, 'body'),
'to': map_to_list(emails, 'to'),
'from_': map_to_list(emails, 'from')
}

He then runs through all the messages collecting these together into a single record. Below is the map_to_list method to wrap this up.

def map_to_list(emails, key):
results = []
for email in emails:
if key not in email:
results.append('')
else:
results.append(email[key])
return results

If we run our email set through and preview the result we see it did a very good job of isolating the message body. Note this is not perfect, so you’ll need to assess if a different way of extracting the message body will better suit your needs.

email_df = pd.DataFrame(parse_into_emails(email_subset.message))
print(email_df.head())

LDA and Topic Modeling

There are many ways to explore the topics of these emails.

Latent Dirichlet Allocation (LDA) is one way to do this. LDA is a bag-of-words algorithm that helps us to automatically discover topics that are contained within a set of documents.

To make this clearer, let’s craft an example borrowing from this introduction to LDA. Here are a few sentences:

  • I wish I had a Lexus to drive to the game next week
  • My trusty Honda needs an oil change
  • Becky needs the Honda to go pick up the desk
  • Jimmy fell off his chair laughing
  • John put the couch and desk up for sale on eBay

If you were forced to pick two topics (topic A and B) you can safely say the topic representation across sentences could be broken down to:

Sentence 1 & 2: 100% about topic A

Sentence 4 & 5: 100% about topic B

Sentence 3: 50% topic A, 50% topic B

Going further, the word distribution across topics for all sentences breaks down to

Topic A: 25% Honda, 15% Lexus, 15% drive, 15% oil change …

Topic B: 30% Desk, 20% chair, 20% couch …

It’s not farfetched to say that Topic A relates to Vehicles and Topic B to furniture. This is what LDA can do for us.

In a nutshell, when analyzing a corpus, the output of LDA is a mix of topics that consist of words with given probabilities across multiple documents.

Under the Dirichlet Hood — What Makes It Tick?

If you care more about exploring this problem and less about how this algorithm works, you can skip to the next section.

Here, I want to briefly touch on what drives this algorithm without getting deep into the Math. Wrestling with this can help build some intuition on what context LDA is best suited for.

Here we go.

A document is a mixture of different topics which itself is an expression of words all tagged with a given probability of occurrence. Said differently, we have topic representations across all the documents and word distribution across all the topics.

At the heart of LDA is this concept of a Dirichlet distribution. Very much like a Normal distribution, it is a probability distribution.

One key difference is that in this case, instead of the probabilities being sampled over a space of real numbers, it is sampled over a probability simplex — a set of numbers which add up to 1.

The interesting thing about the probabilities generated by LDA for the words is that a word can be reflected in multiple topics.

Makes sense?

Lexus can be in both a luxurious brands topic and a vehicle topic at the same time.

Unlike K-Means, where belonging to one cluster disqualifies you from membership in another cluster, LDA is more flexible.

This opens up the possibilities for the various applications of LDA beyond topic modeling to even recommender systems. You’re not locked to recommending from just one cluster.

If you are interested in reading further and diving into the Math and process check out these resources.

http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

https://medium.com/@soorajsubrahmannian/extracting-hidden-topics-in-a-corpus-55b2214fc17d

Pressing Forward

The companion code for the rest of this task is several lines long and it is not practical to walk through it line by line.

However, you can check out this GitHub repo for the Python Notebook. I’ll highlight key snippets along the way.

Getting this done will require a suite of NLP libraries. A few of these include Gensim, Mallet, Spacy, and NLTK.

Rather than coding our own LDA algorithms from scratch, Gensim and Mallet provide us ready-made APIs. Having two libraries gives us a way to compare performance from two different implementations.

Spacy and NLTK help us manage the intricate aspects of language such as figuring out which pieces of the text constitute signal vs noise in our analysis.

Lastly, libraries like Matplotlib and PyLDAvis allow us to visualize the output of our analysis.

More on this in a bit.

Zooming In: Email Body and Its Parts

Our first step now that we have the body of the emails is to collect them together. We do this by creating a list.

# Convert email body to list
data = email_df.body.values.tolist()

Then we breakdown each sentence into a list of words. As you can see below we can take a peek at the words in the 4th email body.

# tokenize - break down each sentence into a list of words
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[3])

This helps us start with the fundamental ingredients needed for our analysis — tokens.

N-Grams: Tokens That Flock Together

Now, we are starting to get down to the basics.

Our knowledge of language gives us the intuition that when two or three words consistently go together (side by side), there is a richer meaning conveyed than with each individual word.

The word groupingtraffic light carries slightly different meaning as opposed to if the words were inspected independently as traffic and light.

To extract this level of information we create what is called bigrams and trigrams. The former being grouping of two adjacent words. The latter being three adjacent words.

Gensim gives us some nice APIs to manage this. As you can see below we can inspect the output of applying bigrams and trigrams.

from gensim.models.phrases import Phrases, Phraser
# Build the bigram and trigram models
bigram = Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = Phrases(bigram[data_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = Phraser(bigram)
trigram_mod = Phraser(trigram)
# See trigram example
print(trigram_mod[bigram_mod[data_words[200]]])

According to the documentation, these APIs allow us

Analyze a sentence, detecting any bigrams [or trigrams] that should be concatenated.

phillip_allen and data_while_amon are adjacent words which are now concatenated.

You can read more about this here as well as here.

Lemmatization — Getting to the Base Word

Next up, we would like to amplify the signal coming from each token — word(s).

To really grasp this, let us consider these three words

  • Driving
  • Drives
  • Drove

They are all communicating a similar idea and can all be boiled down to a base word, which in this case could beDrive. This base word is called the lemma.

Spacy gives us a way to do this with the flexibility to restrict this to different parts of speech — Nouns, Adjectives, Verbs, and Adverbs.

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out

While we are it, we might as well get rid of stop words.

Stop words are words that appear so frequently in a given language that they don’t pack a lot of meaning in a given text.

Such words as is, the , that and which.

This is not something you ought to do in all NLP tasks, but our intuition is to remove them in this case as they are not likely to add a lot more meaning to a given email body.

Remember, we are looking to boost the signal.

def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

Now we can put this all together

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

Creating a Dictionary and Corpus

We are almost ready to pass the body of our emails into our models.

A quick recap first. Here is what we’ve done so far:

  • Collect the body of emails into a list
  • Tokenize the text into separate words
  • Group bigrams and trigrams
  • Consolidate similar terms using lemmatization
  • Remove stop words to minimize the noise

Most of the above tasks are pretty common cleaning and processing steps in NLP. So, what next?

We need to keep in mind that our algorithms at their core operate on mathematical calculations. Processing our text then requires us to represent the tokens and text numerically.

One common way to manage this is by creating an index to word mapping. This allows us to use a look-up table of sorts to map tokens to numbers/index.

We can use the Gensim corpora dictionary API to accomplish this

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

The output of this is a dictionary of sorts.

Next, we need to apply this dictionary factoring term frequency to all our text to create a corpus.

This way as we process a token we get a sense of its weight in the document.

The resulting corpus will be passed in as input when creating our LDA model. The doc2bow API converts each text into a bag-of-words format.

Each time we now see a token in a text, information on its frequency is paired with it. A word/token like contract could then be represented as (6, 3) — >(token_id, token_count).

# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

Remember LDA is based on the Dirichlet distribution which is a probability distribution accounting for word frequencies.

Representing our corpus in this way sets the stage for creating our LDA models.

First up, GenSim LDA model.

Topic Modeling — Gensim LDA Model

The Gensim package gives us a way to now create a model.

You can read up on Gensim’s documentation to dig deeper into the algorithm. Here, we’ll focus on creating the model.

We build the model as such:

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)

Per the documentation, here’s what you need to know

corpus — Stream of document vectors or sparse matrix of shape (num_terms, num_documents)
id2word – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
num_topics — The number of requested latent topics to be extracted from the training corpus.
random_state — Either a randomState object or a seed to generate one. Useful for reproducibility.
update_every — Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.
chunksize — Number of documents to be used in each training chunk.
passes — Number of passes through the corpus during training.
alpha — auto: Learns an asymmetric prior from the corpus
per_word_topics — If True, the model also computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature-length (i.e. word count)

Let’s see what keywords look like in our generated topics.

print(lda_model.print_topics())
A snippet of topics with tokens

The number each token is multiplied by is its weight. These values reflect how important a token is within that topic.

What would you guess the topic below is about?

[pgn, ferc, governor, implementation, vote, authority, consider, offer, burrito, believe]

Something related to a governing body. Burrito definitely throws that off.

What about?

[font, size, align_right, align_left, nbsp_nbsp, tr, face_verdana, sans_serif, width, king]

Hypertext?

How about we visualize the output?

Visualizing our model using PyLDAvis
# Visualize the topics
pyLDAvis.enable_notebook(sort=True)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
pyLDAvis.display(vis)

A few observations

  1. The size of the bubbles tells us how dominant a topic is across all the documents (our corpus)

2. The words on the right are the keywords driving that topic

3. The closer the bubbles the more similar the topic. The farther they are apart the less similar

Preferably, we want non-overlapping bubbles as much as possible spread across the chart.

Let’s check out a few topics.

Generally, we get more value for this if each document has more words. Yet, the algorithm feels to be doing something right at first glance.

However, how do we judge how well this model has done without inspecting every single topic?

Model Perplexity And Coherence

The perplexity and the coherence scores of our model give us a way to address this.

According to Wikipedia:

In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.

Said differently:

Perplexity tries to measure how this model is surprised when it is given a new dataset — Sooraj Subrahmannian

So, when comparing models a lower perplexity score is a good sign. The less the surprise the better.

Here’s how we compute that.

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))

Though we have nothing to compare that to, the score looks low.

You’ll find that Coherence score is a better predictor of the quality of topics as opposed to the Perplexity score.

Let’s explore that.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

This score is trying to quantify the semantic similarities of the high scoring words within each topic. A high score means the result is more human-interpretable.

So, naturally higher coherence score means a better model.

If you want to go deeper into coherence scores, this paper does a good job. The link below is also a great resource for that.

http://qpleple.com/topic-coherence-to-evaluate-topic-models/

Both scores give us a way to quantify and compare the quality of our models.

Next up, is the Mallet based LDA model.

Topic Modeling 2.0 — Mallet LDA Model

According to the official Mallet website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

In our specific case, we’ll be leveraging the topic modeling portion of it.

note: make sure to verify that Java is installed in whatever environment you are running this on especially if it’s in a notebook. You might need to run a command like ‘apt-get install openjdk-8-jdk’ if you are on a LINUX based environment

Once you have downloaded Mallet, set a reference to the path of the unzipped folder and the create the model.

One good news is that we can reuse the corpus we already created.

mallet_path = 'mallet-2.0.8/bin/mallet' 
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

Now, we can inspect some of the topics.

# Show Topics
print(ldamallet.show_topics(formatted=False))

Instead of visually inspecting the topics this time we can compare the coherence score to our previous model.

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

We can see some improvement from 0.412 to 0.514 — proof of a better model.

You’ll find a number of resources highlighting that Mallet typically gives you better results. This confirms this so we’re going to stick with the Mallet generated model.

Determining the Optimal Number of Topics

Both models expect you to pass in the number of expected topics as input. How do you figure out what this should be?

In our case, we selected 20 topics. But, was this a good choice?

Our coherence score actual gives us a good clue. We want this score to be as high as possible for the most number of topics.

So, we can run through a variety of topic numbers to observe what this does to the coherence score.

Starting with two topics you can see what our score looks like. As you gradually increase this value you can see how this impacts the coherence value.

Run through different ‘number of topics’

Plotting these values we can get a clearer view.

Coherence vs Number of Topics

The magic number seems to be around 14 topics, so we’ll assume this number of topics moving forward.

Here is a glimpse of what our word topic grouping now looks like.

Remember, the weights are indicators of how important a token is within that topic

Dominant Topics and Relevant Keywords

Something interesting we can now do is to look at each document (email) to assess the dominant topic and related keywords.

This could be useful in a text summarization or topic labeling task.

The approach we are using here is to figure out which topic contributes the highest percentage to a given document.

Reminder: Refer to this python notebook for the code.

Note that this system is not smart enough to give each topic an actual phrase. A topic number/id is all that is assigned. As is, it will take us exploring the keywords in a topic to describe the said topic.

Inspecting each email_body for its dominant topic and keywords within the said topic

We can inspect the second email and see the dominant topic’s keywords.

Best Example Document (Email) For a Given Topic

Now, imagine we pick a topic like vacation, we might want to find out which email is most relevant to that topic.

Below, we peek into 5 topics while highlighting the keywords and the email best representative of that topic.

First 5 topics with Email body that best represents that topic
A snippet of email best representative of topic 0

You can take this further by trying to figure out the top ndocuments best representative of a given topic.

What Did We Learn?

A true scientist is never afraid to examine the results of his or her experiment. Here goes.

I wrongly assumed that we might see some emails pointing to bad practices that led to the fall of the company.

Then I realized some flaws in our approach.

1). To manage the size of data being processed at a time we had simply grabbed the first 10,000 emails.

email_subset = emails[:10000]

To add a little more variety you can switch this to selecting a random sample of about the same size.

email_subset = emails.sample(frac=0.02, random_state=1)

You set frac to 0.02 for 2% of the whole dataset and random_state to 1 to get consistent results. This increases the coherence score from 0.514 to 0.5883.

Here are the keywords for the top 5 topics:

Topic 1:

[agreement, attach, doc, draft, comment, change, letter, ca, energy, document]

Guess: Energy-Related contract in California?

Topic 2:

[original_message, work, call, meeting, discuss, good, give, time, talk, meet]

Guess: Communication?

Topic 3:

[power, energy, state, California, utility, price, electricity, plant, market, electric]

Guess: Energy-Related Business in Cali?

Topic 4:

[david, william, mark, scott, michael, steve, robert, paul, chris, richard]

Guess: People in the email?

Topic 5:

[market, issue, ferc, order, cost, provide, rate, transmission, require, include]

Guess: Regulations (FERC — stands for Federal Energy Regulatory Commission)?

2). Another potential flaw in our approach is that the emails in our analysis did not focus on the key players. Better analysis will make sure their emails are represented in the data set.

This highlights the value of domain knowledge in such an analysis.

Can you think of any other tweaks?

Summary

In this post, we have walked through some important aspects of exploring topics within a corpus. Here are a few takeaways.

  • There are amazing tools at your disposal to evaluate documents for themes or topics. LDA is one such algorithm.
  • Depending on where you are starting from, some cleaning and processing will be required to create your corpus. Removing stopwords, lemmatization, and N-Grams are just a few techniques.
  • Some tuning and refinement will be needed to get the best results for your model. This might entail figuring the right number of topics or themes. It could also require you get more data with each document containing more words.
  • Find the right metrics to evaluate your model. Coherence is a great metric and your goal is to increase this score as you refine.

Your Move

You should now be able to apply LDA learnings of this post to

  • Determine the dominant topics and related keywords in a given corpus
  • Find the best example of documents that align within a given topic
  • Be able to determine how similar two or more corpora are by comparing the dominant topics in them
  • Create a recommender system to recommend similar articles or documents after grouping them by topics

Props to the good folks at machine learning plus whose post and code served as a guide for most of this material.

Thanks for reading and all the best on your topic modeling journey.

Follow me on LinkedIn https://www.linkedin.com/in/shofola/

Best wishes!