Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500 — part1: data preparation and model

Wenchen Li
Jul 24, 2017 · 3 min read

contribution of this paper:

  1. unique in deploying three state-of-the-art machine learning techniques and their simple ensemble on a large and liquid stock universe
  2. It reveals that ensemble returns only partially load on systematic sources of risk, are robust in the light of transaction costs, and deteriorate over time — presumably driven by the increasing popularization of machine learning and the advancements in computing power. However, strong positive returns can still be observed in recent years at times of high market turmoil
  3. focus on a daily investment horizon instead of monthly frequencies, allowing for much more training data and for profitably exploiting short-term dependencies

data

S&P500

  1. Obtain all month end constituent lists for the S&P 500 from Thomson Reuters Datastream from December 1989 to September 2015. We consolidate these lists into one binary matrix, indicating whether the stock is a constituent of the index in the subsequent month or not.
  2. For all stocks having ever been a constituent of the index, we download the daily total return indices from January 1990 until October 2015. Return indices reflect cum-dividend prices and account for all further corporate actions and stock splits, making it the most suitable metric for return calculations.

training dataset is generated with 750 days as training set and 250 days as development set with a sliding window of 250 days.

Method

input & output:

  1. input : input feature is defined contains different gratuity.
  1. output:

cross section of stock return is explained here

DNN ( deep neural networks )

simple mlp is used(31–31–10–5–2 )

with maxout activation, dropout technique(hidden layer dropout rate 0.5, input dropout rate 0.1) , lambda_L1=0.00001 . Optimizer is ADADELTA.

GBT ( gradient-boosted trees )

AdaBoost, deploying shallow decision trees as weak learners.

The number of trees or boosting iterations MGBT=100 , the depth of the tree JGBT=3 , the learning rate λGBT=.1 , and the subset of features to use at each split, i.e., mGBT=15 .

RAF ( random forests )

For each of the BRAF(1000 trees) trees in the random forest, we first draw a random subset from the original training data. Then, we grow a modified decision tree to this sample, whereby we select mRAF=floor(square(p)) features at random from the p features upon every split. We grow the tree to the maximum depth of JRAF=20 . The final output is an ensemble of BRAF random forest trees, so that classification can be performed via majority vote. All the result of the value use H20 default in random forests.

ENS( ensemble the above )

the ensemble is just a average probability forecast of the above three method.

reference:

  1. Krauss, Christopher, Xuan Anh Do, and Nicolas Huck. “Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500.” European Journal of Operational Research 259.2 (2017): 689–702.

Wenchen Li

Written by

objective: happiness, dataset: life, model: brain, optimization algorithm: SGD, code at https://github.com/WenchenLi

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade