Notes on the Numerai ML Competition
Last week I spent some time diving into the Numerai machine learning competition. Below are my notes on the competition: things I tried, what worked and what didn’t. First an introduction to Numerai and the competition…
Numerai is a hedge fund which uses the competition to source predictions for a large ensemble that they use internally to make trades. Another detail that makes the competition unique is that the provided data has been encrypted in a way that still allows it to be used for predictions. Each week, Numerai releases a new dataset and the competition resets. After briefly controlling 1st-2nd place in both score and originality, by the end of the week I was still “controlling capital” with a log loss of 0.68714. In all this earned about $8.17 USD worth of Bitcoin.
Here’s a sample of the training data:
My first step in the competition was to generate a validation set so that I could run models locally and get a sense for how the models would do on the leaderboard. Using a simple stratified split that maintains the target distribution turned out not to be representative of the leaderboard so I turned to “adversarial validation”. This clever idea was introduced by @fastml in a blog post here. Basically:
- Train a classifier to identify whether data comes from the train or test set.
- Sort the training data by it’s probability of being in the test set.
- Select the training data most similar to the test data as your validation set.
This was much more representative with a validation loss corresponding to within ~0.001 log loss on the public leaderboard. Interestingly, the only reason this works is that the test data is dissimilar from much of the training data which violates IID.
Now that I had a good validation set I wanted to get a baseline model trained, validated and uploaded. As a starting point I used logistic regression with default settings and no feature engineering. This gets about 0.69290 validation loss and 0.69162 on the public leaderboard. It’s not great but now I know what a simple model can do. For comparison, first place is currently 0.64669, so the baseline is only about 6.5% off. This means any improvements are going to be really small. We can push this a little further with L2 regularization at 1e-2 which gets to 0.69286 (-0.006% from baseline).
I took a quick divergence into neural networks before beginning feature engineering. Ideally, the networks would learn their own features with enough data, unfortunately none of the architectures I tried had much improvement over simple logistic regression. Additionally, deep neural networks can have far more learned parameters than logistic regression so I needed to regularize the parameters heavily with L2 and batch normalization (which can act as a regularizer per the paper). Dropout sometimes helped too depending on the architecture.
One interesting architecture that worked okay was using a single very wide hidden layer (2048 parameters) with very high dropout (0.9) and then leaving it’s initialized parameters fixed during training. This creates an ensemble of many random discriminators. While this worked pretty well (with a logloss around 0.689) the model hurt the final ensemble so it was removed. In the end neural networks did not yield enough improvement to continue their use here and would still rely on feature engineering which defeated my intentions.
Data Analysis & Feature Engineering
Now I need to dig into the data, starting with a simple plot of each of the feature distributions:
The distributions are pretty similar for each feature and target. How about correlations between features:
Okay, so many of the features are strongly correlated. We can make use of this in our model by including polynomial features (e.g. PolynomialFeatures(degree=2) from scikit-learn). Adding these brings our validation loss down to 0.69256 (-0.05% from baseline).
Now dimensionality reduction. I take the features and run principal component analysis (a linear method) to reduce the original features down to two dimensions for visualization:
This does not contain much useful information. How about with the polynomial features:
The polynomial PCA produces a slightly better result by pulling many of the target “1” values towards the edges and many of the target “0” values towards the center. Still not great so I opted to omit PCA for now.
Instead I’ll use a fancier dimensionality reduction method called t-SNE or “t-Distributed Stochastic Neighbor Embedding”. t-SNE is often used for visualization of high-dimensional data but it has a useful property not found in PCA: t-SNE is non-linear and works on the probability of two points being selected as neighbors.
Here t-SNE captured really good features for visualization (e.g. local clusters), and incidentally for classification too! I add in these 2D features to the model to get the best validation loss so far: 0.68947 (-0.5% from baseline). I suspect the reason this helps is that there are actually many local features that logistic regression cannot pull out but are useful in classifying for the target. By running an unsupervised method specifically designed to align the data by pairwise similarities the model is able to use that information.
Since t-SNE is stochastic, multiple runs will produce different embeddings. To exploit this I’ll run t-SNE 5 or 6 times at different perplexities and dimensions (2D and 3D) then incorporate these extra features. Now the validation loss is 0.68839 (-0.65% from baseline).
Note, some implementations of t-SNE do not work correctly in 3D. Plot them to make sure you’re seeing a blob, not a pyramid shape.
Since t-SNE worked so well, I implemented several other embedding methods including autoencoders, denoising autoencoders, and generative adversarial networks. The autoencoders learned excellent reconstructions with >95% accuracy, even with noise but their learned embeddings did not improve the model. The GAN, including semi-supervised variant, did not outperform logistic regression. I also briefly experimented with kernel PCA and isomaps (also non-linear dimensionality reduction methods). Both improved the validation loss slightly but took significantly longer to run, reducing my ability to iterate quickly, so they were ultimately discarded. I never tried LargeVis or parametric t-SNE but they might be worth exploring. Parametric t-SNE would be particularly interesting since it allows fitting on a test holdout, rather than learning an embedding of all of the samples at once.
One of the models that made it into the final ensemble was to explicitly model pairwise interactions. Basically, given features from two samples predict which of the two had a greater probability of being classified as “1”. This provides significantly more data since you’re modeling interactions between samples, rather than individual samples. It also hopefully learns useful features for classifying by the intended target. To make predictions for the target classification I take the average of each sample’s prediction against all other samples. (It’s probably worth exploring more sophisticated averaging techniques.) This performed similarly to logistic regression and produced different enough results to add to the ensemble.
Now that we have useful features and a few models that perform well I wanted to run a hyperparameter search and see if it could outperform the existing models. Since scikit-learn’s GridSearchCV and RandomSearchCV only explore hyperparameters, not entire architectures, I opted to use tpot which searches over both. This discovered that using randomized PCA would outperform PCA and that L1 regularization (sparsity) slightly outperformed L2 regularization (smoothing), especially when paired with random PCA. Unfortunately neither of the discovered interactions made it into the final ensemble: hand engineering won out.
The final ensemble consisted of 4 models: logistic regression, gradient boosted trees, factorization machines and the pairwise model described above. I used the same features for each model, consisting of the original 21 features and five runs of T-SNE in 2D at perplexities of 5.0, 10.0, 15.0, 30.0, and 50.0 and one run of T-SNE in 3D at a perplexity of 30 (I only included a single run because it takes significantly longer in 3D). These features were combined with polynomial interactions and run through the models to produce the final log loss of 0.68714 on the leaderboard.
Overall it was an interesting competition—very different from something like Kaggle. I especially enjoyed experimenting with the encrypted data which was a first for me. While the payouts and “originality” bonuses are interesting mechanics, it’s often better to look at the rewards as points, more than currency, as this made the competition overall more fun. On the other hand, now I have my first bitcoin… :)