Fraud Detection In the Enron Emails

Aly S
Analytics Vidhya
Published in
7 min readDec 13, 2019

So we all remember, or have read about, the massive risk management failure at Enron. It had all the classic red flags, but blew up to such a massive scale because somehow the Board of Directors remained ignorant of what was really going on. They instead relied on Lay and then Skilling and their supposed ‘expertise’ as well as the opinions of the now dismantled Arthur Anderson. Could this have been avoided? Certainly. As I mentioned many of the red flags from other risk management failures were there. It’s easy to ignore qualitative issues — they’re explainable and you can talk your way around ignoring them. But quantitative…that is a much harder reality to ignore. Numbers don’t really lie the same way people do.

I set out to build a machine learning model that would allow senior managers or board members to see people who may warrant additional scrutiny. I used the email traffic to classify the text and with fairly accurate results could put people on a ‘investigate further’ list. Perhaps had the board been required to review these findings, the scale of the scandal would not have caused nearly 20 years of repercussions for both the energy and financial markets. Here, I will detail my process at a high level, though for those interested you can find the full code here. For those following along in the code, please open the notebooks in order. The size of the dataset required me to break this up in a number of different notebooks; it saved me from keeping the massive dataset read into memory at any given time.

Most of the real analysis will be done with the help of NLTK and Sklearn. Getting the data ready however took up most of the time spent on this model. E-mails are very, very messy. And this dataset contains very, very many emails. Parsing the files into a dataframe was the first task and I ended up using a library I found called Parser. It allowed me to loop over the directories and subdirectories to save each email section into it’s own list. Those were then concatenated and ready for use.

I wanted to see if I would be able to review the dates of the scandal by plotting the number of emails sent in any one day. This took some serious scrubbing but I was able to arrive at the following. In the real world, where you have a full body of emails instead of those just pertaining to the scandal, one could fathom even an escalating number of emails could be worth investigating, particularly if it did not flow with previous patterns.

I also wanted to see what a network map would look like of the dataset. I used NXVIZ which is a visualization toolkit for Network X. I used three different types of mapping to show the interplay between individuals. Certain people at the centers of these maps could be seen as high-risk given the scope of their interactions. I included the network map before but in my code you’ll also find an arc plot and a circos plot.

#drawing the network map
plt.figure(figsize=(20,20))
position = nx.spring_layout(G, k=.1)
nx.draw_networkx(G, position, node_size=25, node_color=’red’, with_labels=False, edge_color=’blue’)
plt.show()

Obviously, this is ridiculous and is what happens when you look at something so large. Here’s a slice of the top 1,000 senders and their interactions in the form of recipients. These are sorted based on node weight; those with higher weights should be more ‘important’ to the investigation we are conducting.

top_edges = sorted(G.edges(data=True), key=lambda t: t[2].get(‘weight’, 1))
top_edges = top_edges[:1000]

Given the size of the data set, I started some additional exploring using emails from only one individual, Kay Mann. We wanted to look at word frequency but tokenizing the text is processor intensive and I chose to test my method on a smaller dataset before using the whole thing. This allowed us to use the frequency distribution from NLTK to get top words and their use counts. With such a simple metric, displaying the top words as a wordcloud was a natural step. The library WordCloud makes this incredibly easy and allows us to use Matplot and seaborn for formatting and display options.

top_keys =list(top_words.keys())
most_frequent={}
for key in top_keys:
new = key.lower()
most_frequent[new]=top_words[key]

#creating WordCloud object
comment_words2 = str(most_frequent)
wordcloud2 = WordCloud(width = 800, height = 800,
background_color =’white’,
stopwords = stopwords_list,
min_font_size = 10).generate(comment_words2)

# plot the WordCloud image
plt.figure(figsize = (15,15), facecolor = None)
plt.imshow(wordcloud2)
plt.axis(“off”)
plt.tight_layout(pad = 0)
plt.show()

For those interested, some of the people included in this word cloud had very fascinating parts to play in the Enron scandal. I included a few in my code descriptions available on my github repo.

Now that we have some of the fun stuff down, including some statistics I chose not to include here, the next step in the analysis is, well, the analysis. What we are going to do is take an unsupervised machine learning model and cluster the dataset to create the labels that will be used for a supervised classification model. This should give a ‘to-be-investigated-further’ list. Which in terms of resources, allows fraud analysts to use their time for additional tasks or for the business to allocate those resources elsewhere.

To do what I want to do here, we need to take the outrageously large sparse matrix that is our current independent variable, and slice it before we can make it dense. Matrices that large just cannot be condensed in this way using the standard python libraries on a personal machine. It’s why we use the sparse matrices in the first place. We can still use a very large matrix [40,000 x 40,000] for our model. The matrix consists of the vectorized TF-IDF features for our word body. We used this method to account for the fact that some words in english are just more common than others, regardless of the corpus they appear in.

What model will we be using? I decided to implement a K-Means clustering algorithm, initialized with two clusters. Hopefully, this would output a cluster of important people and a cluster of important words. These labels will in turn be our input for a supervised model, but more on that later. Below is the model output where we used the Kmeans Clustering. We actually used KMeans mini batch to save the computing power since it does not require the whole dataset to be written into memory as it moves along.

Two — Cluster Kmeans Analysis

Now with two clusters, each saved in its own dataframe we can make the below plot. It is the weighted top 25 words for each of the two clusters we used. While the plot isn’t perfectly segregated, if we were to use the whole dataset (somehow!) we could extrapolate the accuracy here would be higher. Regardless, I was satisfied enough to use these as the ‘true’ labels for the supervised model.

With available labels, the attention turns to a supervised model. I chose K-nearest neighbors for its classification and lazy computing properties. My hope was to create a model that would output the appropriate cluster so in turn an analyst would be able to know who might need additional scrutiny. We implemented the KNN model with the nearest 3 neighbors and came up with a greater than 90% accuracy score on the test data.

That sounds pretty good to me.

If I had an infinite amount of time to continue plugging away at this we could do some additional things. First, we could find the necessary resources to be able to run the algorithm on the full dataset. Second, we could break our clustering and classifying into two steps: find the people of interest and then going on to look at the language they used in their emails to see if that could be a red flag for fraud in itself. Finally, we could try a number of different clusters or neighbors to see if we could further improve model performance. Since my time is not infinite, these will have to be shelved ideas for a rainy day.

I know for a fact, as I worked in corporate finance and risk management for over a decade, that these sorts of algorithms are on the cutting edge of Fraud detection and financial forensics. Hopefully, this application carries some additional insight for the community looking at how to prevent these sorts of risk management failures in the future. Certainly, it’s one way to learn from history so we don’t make the mistakes of the past. Happy Modeling!

--

--

Aly S
Analytics Vidhya

Data Scientist, Financial Risk Manager, Competitive Ballroom Dancer. Want more? Please visit aleighasardina.wordpress.com for my portfolio.