Enron fraud detector

Udacity added a new person of interest (POI) feature to the public Enron dataset — Enron employees who were indicted, settled without admitting guilt or testified in exchange for immunity.

The data is financial (salary, loan advances etc.) and email related (from and to addresses).

Cleaning and initial analysis pulled out a few hidden errors planted by Udacity.

I used individual box plots to devise new features and SelectKBest to score them according to their relevance with the target variable (POI).

I ended up with 6 original features (salary, bonus, total_stock_value, exercised_stock_options, deferral_payments and director_fees) and 5 new features:

*odd_payments — a list of POI-only extreme values 
*key_payments — salary + bonus + other
*retention_incentives — long_term_incentive + total_stock_value
*total_of_totals — total_payments + total_stock_value

I also tried various combinations with the email data, such as a count of any POI correspondance but the SelectKBest scores weren’t high enough.

I ran the features through several different SKLearn algorithms (GaussianNB, LinearSVC, DecisionTree, LogisticRegression, RandomForest) and I used a combination of GridSearchCV and intuition to tune the parameters.

My final algorithm was a Decision Tree with min_sample_splits = 5 and splitter = random. The StratifiedShuffleSplit test results were:

Precision: 53% (% POIs identified who are definitely POIs)
Recall: 43% (% POIs in the dataset who are being identified correctly)

This is not a brilliant result but not surprising considering this is such a small dataset — 143 employees, 12.5% of whom are POI’s. A possible next step is to look at the actual emails : links between addresses and key words within the content.

The test also assumes that the POI list is 100% accurate. Maybe some people got away with it?