We’re now a month into the ChiPy Python mentorship, and I’m blown away by how versatile and concise Python is.
Last time I posted, I was looking for a new project idea, because the Bureau of Economic Analysis beat me to my last one. I’m a data consultant and already have experience with ETL (extract, transform, load — essentially large-scale, automated data cleanup), so I didn’t want a project that would require me to spend too much time gathering and cleaning data. I searched the Internet for datasets that were already ready to go, and immediately one caught my eye.
Side note: These two blog posts are great resources if you’re looking for data to work with for a project.
- 18 places to find datasets for data science projects
- 19 free public datasets for your first data science project
The dataset I chose is the Enron email dataset that was released by the Federal Energy Regulatory Commission (FERC) in 2002 following the exposure of Enron’s accounting fraud and the subsequent FBI investigation. The dataset contains over half a million emails sent by 160 of Enron’s top executives. The data was later purchased and cleanup up by UMass and re-released for research purposes. To date, 16 years later, the Enron emails are the largest set of real emails ever released to the public.
My interest in this data is twofold. The original purpose of acquiring the emails was obviously to investigate the Enron executives who were potentially involved in the large-scale fraud going on at the firm. In 2001, the tools available for data analysis were far less sophisticated than what we have today. I’d like to do an investigation of my own using a graph database for relationship mapping, and NLTK for processing the emails.
My second area of interest is in exploring these emails outside of their relationship to the Enron scandal. The vast majority of these emails have no connection to any criminal activity, and are just employees communicating about work, scheduling meetings, sending requests, and discussing their personal lives. This dataset has been used in research on topics ranging from spam detection to workplace gender studies. If doing my own Enron investigation isn’t enough work on its own, I’ll try to create robot secretary algorithms that can pick out to-dos or flag important messages.
My first step in this process is to load all of the emails into a database so that I can explore the data more easily. I work often with SQL databases, but have very limited experience using Python for data processing. My mentor recommended I use SQLAlchemy, and I’ve found it pretty user friendly so far. For the database, I chose Postgres, because I’m familiar with it and it’s open source. For my Postgres interface I use pgAdmin III.
What is SQLAlchemy?
SQLAlchemy is a SQL toolkit and Object Relational Mapper (ORM) for Python. It bridges the gap between Python and SQL so that you can interact with a SQL database more easily using Python code.
What is an ORM?
An ORM (Object-Relational Mapping) creates a virtual object representation of your database model so that you can interact with it using an object-oriented programming language. It “Maps” a “Relational” model (the database) to an “Object” model (the programming object). This allows you to interact with your object representation of the database in ways that are native to the programming language, rather than having to constantly embed and execute SQL statements and maintain an active database connection.
My data model
The emails unzip from a .tar file to thousands of .file files and are formatted like this:
I thought carefully about how to structure my database to be useful, and settled on a simple relational model with two tables, emails and addresses. The headers and content of each email are displayed in emails, and addresses contains each person the email was sent to and whether they were on the To, Cc, or Bcc line.
I chose this because it’s so frequent for the same email to be sent to many people at once. I didn’t want the same email’s contents duplicated over and over, because this would throw off analysis if I look for specific words or try to pick out emails with certain subject matter. The emails table allows me to look through the contents of the emails without filtering for duplicates.
However, one of my main goals with the data is to map the relationships between the senders. To do this, I need to know each sent-received connection. I decided to parse the To, Cc, and Bcc lines out to create one row per recipient in the Addresses table. That way I don’t need to search through a huge list of string arrays to find all the emails that were sent to a specific person.
Example of how an email would be stored:
Moving on to code, in my example below, I have two tables in my Postgres database, emails and addresses. They are represented as Python classes Emails and Addresses. Each column in each table is a object in its class — for example, message_id is a Column object with the datatype String. The label_by_key dict is used for mapping our parsed emails to the table later.
With these classes set up, you can use SQLAlchemy to create the tables in Postgres, which guarantees that the object model exactly matches the relational model in the database. You can do this in one line:
Tables are ready to go in Postgres.
I’m in the process now of parsing the emails with regular expressions and loading them to the database. This is what that looks like:
More to come on this, but what really blew my mind was lines 22–24 parsing every email in the directory and adding the parsed email to the list of dicts. I couldn’t believe how concisely and easily I was able to do that.
Shoutout to my mentor Jay for a couple deep dives into regular expressions! Could not have figured those out without him. I’m excited to get all my data loaded and start exploring.