The Great Hack — Approach Outline

Sep 1 · 5 min read
Credits: Adaptive Insights


This is a follow-up, a more detailed post on the first-ever alternative data in finance hackathon the DataScrum community will be holding on the 14th of September, 2019 at Microsoft Reactor in London as described on our Hackathon Eventbrite page (and also have a look at our previous post on the topic).


Let’s suppose that you are a researcher or portfolio manager at a quantitative alternative data hedge fund and trying to model a trading strategy that relies on using alternative data, such as web traffic of ‘pure-play’ publicly listed companies, including online retailers, booking systems and publishers to beat the market consensus forecast of earnings of such companies.

Below we will detail how to execute an end-to-end trading strategy, what data we would need, how would the predictive model choices look like, how would be pick the best model and finally what tools we would use to achieve all this.

Sample Trading Strategies

To motivate our work, suppose that we are trying to tackle the bread-and-butter equities quarterly earnings surprise and disappointment that most alt data quantitative hedge hedge funds deploy today (see also our post Outline for Alternative Data Hedge Fund Strategy).

Here are some possible choices of companies for such a strategy:

  • Online retailer example — ASOS (vs social media, web traffic)
  • Online booking service example — (vs social media, web traffic)
  • Gaming example — Electronic Arts (vs public gaming stats API and Twitch streams)
  • On / offline retailer — Urban Outfitters (vs social media, web traffic to measure online activity, and hiring numbers and geolocation data to measure offline foot traffic)

Now that we have our trading ideas, let’s see what kind of data we need.

Credits: Toptal

Data — Classical and Alternative

We have two types of input data to our predictive model(s) — financial data and non-financial or alternative data.

Non-Financial or Alternative Data

Some categories of alternative data could include (if you are new to the term, please see our post Data Science — What is Alt Data or Alternative Data? on the topic):

  • Web traffic, social and sentiment, and app usage
  • Credit/debit card transaction data
  • Web email and consumer receipts
  • Geo-location, satellite, and weather
  • Sensor / IoT data

Financial Data

The classical financial data we are interested in would most likely include:

  • End-of-day price and volume data — this would include end-of-day (adjusted for dividends and splits) prices and potentially the daily volume that exchanged hands
  • Fundamental income statement data — these are data that public companies publish periodically as required by law such as revenues and earnings
  • Analyst consensus estimate data — revenue and earnings estimates and their ranges and perhaps the number of analysts giving the rating as this could help us with prediction error
Credits: Tableau


Now that we have our data for the model, let’s get started building our model. As in any data science project, we would need to go through the main steps of data preparation, model selection, and model testing or validation. There are many choices here in the vast field of data sciences, but below are just some options.

Credits: Towards Data Science

Data Preparation

  • Data cleansing, filtering and preparation
  • Data exploration and visualisation
  • Aggregating and encoding categorical data into features

Model Selection — Classical Models

  • Simple features and shallow ML using linear and logistic regression with L1 and L2 regularizations
  • Features engineering and features selection

Model Selection — Machine Learning Models

  • Supervised vs unsupervised learning approaches
  • Random forests and gradient boosted trees with Scikit-Learn and LightGBM
  • Convolution Neural Networks (CNN) and Reinforcement Learning (RL)
  • AutoML — open source solutions (Mindsdb, TPOT)
  • AutoML — vendor solutions (Azure ML, Google AutoML, AWS Sagemaker, H20)

Model Selection — goodness of fit and learning outcome metrics

  • Backtesting — proving model performance via a historically simulated financial metric such as PnL (profit and loss) or risk-adjusted return ratio (aka Sharpe ratio)
  • Simulated trading — also known as paper trading, we may want to run our model in real-time but without actually executing trades to see if we make ore lose money

Now that we have a plan, what types of tools do we have at our disposal to implement all this?

Credits: KDnuggets


In case we want to go with the popular Python ecosystem, below are some options:

  • Coding environment — we have a choice of offline IDEs, and cloud-based solutions such as Jupyter, Azure Notebooks, Google Colabs
  • Visualisation — in data exploration phase, visulisation is key so one could try here Matplotlib, Seaborn, Plotnine or
  • Python core libraries — we will most likely need here Pandas, Numpy, Scipy
  • Stats, AI and Machine learning libraries — popular choices will include Statsmodels, SciKit-Learn, Keras, TensorFlow and PyTorch
  • Automated ML environments — again many choices, but we might give it a go to Amazon Sagemaker, Azure ML, Google AutoML or open-source tools as TPOT and MindsDB

There are many ways of approaching arriving at the solution and above is by no means an authoritative or an exhaustive lists how to approach the problem, but hopefully helpful to some contest participants to some extent.

Final Thoughts

To put everything into a context, feel free to revisit our related Medium posts:

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade