My Journey of Creating, Extremely Munging, and Machine Learning Unstructured Data for Investing Decisions

I’m part of a team that is programmatically extracting meaning from unstructured data (words), with an end goal of helping people invest inline with their goals and values. The area is called sustainable investing or “ESG” (Environmental Social Governance). The project is called Data Simply and you can read more about the overall project here.

We’ve now gotten it to a state where we have automated the gathering, processing, serving, and analyzing of disclosure documents that companies make at the SEC. Our technology reads and understands the words and turns those into quantitative investment signals with a way to drill back into the source data.

We do this at scale with low-latency (aka near real-time). We’ve generated 90+ million unique searchable data points and 1+ Terabyte of data, mining the most important filings from every company at the SEC across all 3 major US exchanges, and do this every hour of every day. Rather, our technology does this for us.

Track your portfolio in the Data Simply dashboard

In this post I’m sharing some things we learned along the way about extreme data munging as well as some specific steps for machine learning data analysis.

Step 1: Data Wrangling

I have often heard that when working with data, the data wrangling or “data munging” is 80–90% of the work. I completely agree with this and have found it to be true over and over with my data projects.

Why does this take so much effort? You need to pay attention to all these things:

  • Data acquisition (finding good data sources, accurately gauging the quality and taxonomy of the data, acquiring and inferring labels). We developed our own data pull from the SEC. The data we get from there needs lot of attention, it’s pretty messy. We tried many datasets and ended up generating most of our own data, apart from some fundamentals data.
  • Data pre-processing (missing data imputation, feature engineering, data augmentation, data normalization, cross validation split)
  • Data post-processing (making the outputs of the model usable, cleaning out artifacts, handling special cases and outliers)
  • (H/T to Emmanuel Ameisen for his description of these categories which are right on)

Step 2: Tuning / Pipelining

We realized early on that we could not simply run one job at a time, wait until that finished, then start another. We also realized that processing power had to co-exist with resources to serve the online application to users, all the time. We also understood that data, once analyzed, needs to be checked and may need to be re-processed or re-analyzed to maintain our high quality standard — so we created jobs for this to run in the nighttime hours.

For example, we pass arguments to perform_async, which can be composed of simple JSON datatypes: string, integer, float, boolean, null, array and hash. We have the Sidekiq client API use JSON.dump to send the data to Redis. The Sidekiq server pulls that JSON data from Redis and uses JSON.load to convert the data back into Ruby types to pass to the perform method.

if ft.complete_html.blank? && ft.plain_text.blank?
PopulateFilingTextJob.perform_async ft.sec_filing_id

I won’t go into much of this here as it deserves it’s own post. I’ll simply say the various jobs are automated, to which we add some tuning and tweakage over time.

Step 3: Machine Learning on the Data

All that to get to data that we can use for machine learning analysis.

Now that we’re here (wipe brow) let’s explore the data …

In this case, I accessed the dataset we had created via our own S3 bucket on AWS. We created a dataset that only includes insights from the data and identification data and leaves the initial creation data behind. This works well.

I’m doing this with Scikit-Learn so I’ll start by setting up the libraries and modules, including what I’ll need for the outlier detection:

# Setup for Outlier Detection
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager

from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor


rng = np.random.RandomState(42)

Then I’ll just set a pointer to look at my dataset in the AWS S3 bucket I mentioned:

# 2) Load the data from the remote url via an AWS S3 bucket

dataset_url = ''
data = pd.read_csv(dataset_url)

To start let’s make sure we can access this data. Let’s try to get the first 5 rows:

print data.head()
Data head with the first 5 rows

Then let’s look at the shape of this data:

print data.shape

That gives us:

(107119, 6)

So we now know there are more than 107 thousand rows and 6 columns. That is a decent amount of data to start with.

Next let’s get some summary stats about this data:

print data.describe()

count 107119.000000
mean 120881.355530
std 71456.683309
min 11.000000
25% 60048.500000
50% 116449.000000
75% 179097.500000
max 251076.000000

Let’s display the data in a dataframe. This gives a visual check of what you are working with. It looks as expected:

Now after all of that things get more exciting (finally)!

I’ve started to explore this data more by looking at plots of the outliers.

This performs four different kinds of Novelty and Outlier detection:

* Based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs better than the One-Class SVM in that case.

* Using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the data is strongly non-Gaussian, i.e. with two well-separated clusters;

* Using the Isolation Forest algorithm, which is based on random forests and hence more adapted to large-dimensional settings, even if it performs quite well in the examples below.

* Using the Local Outlier Factor to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.

Source: Scikit-Learn documentation

Above you can see the print of these four different types of outlier plots: Local Outlier, Isolation Forest, One-Class SVM, and Robust covariance.

Next I will be getting into the analysis of these. I’m excited that there are both central clusters of data points and an interesting number of outliers. I’m eager to explore what these outliers contain and what signals those may send for opportunistic investing decisions. Stay tuned …

Leader, entrepreneur, software engineer, data scientist. See my personal site at and on the Twitter @mbonat