One year participating in the Numerai Machine Learning competition: 7 lessons learned

8 min readMar 16, 2022

Numerai is “The Last Hedge Fund”, one that aggregates stock market predictions from data scientists all over the world, and it seems to be doing quite well.

In this article I want to share some mistakes done and insights gathered from my experience participating in their “classic” tournament. The level of the material is beginner/intermediate, and is suited for anyone interested in the competition, and to some extend in Machine Learning in general.

If you are interested, you can get started with it quite easily!

1. Understand the data well

I am big fan of fast prototyping: I wanted to test several hypothesis quickly, so I just took a fixed subset of the training dataset to make computations complete faster. However the conclusions I took from those experiments turned out later to be wrong, because the data I ran them on was not representative enough!

It was only upon closely examining and understanding the training data that I realized that each section (“era”) of it was valuable. The data had carefully been selected and post-processed, and it contained a very low signal to noise ratio due to the nature of the stock market.

It was also after exploring more the dataset that I realized that some of the post-processing done made several of my initial ideas unnecessary… So, in short:

There are no shortcuts (especially for hard problems with low signal).
Explore & study the data and problem at hand well before getting your hands dirty!

2. Research the field

Financial Machine Learning can feel quite frustrating. Even more so when the data has been heavily obfuscated like in the case of this tournament.

Whenever I face a new and challenging problem I try to find out what the experts and top practitioners in the field say or have published. I think this principle applies more or less to any form of engineering or science.

As an example, Marcos López de Prado, an expert in the field, has published quite interesting material like the book “Advances in Financial Machine Learning”:

Some of the material covered there does not apply to the “classic tournament” because the data is already post-processed, but here is some of what I found useful:

Do not rely too much on back-testing (simulating a financial model on a fixed hold-out dataset). Instead, use strict K-fold cross-validation on the training dataset as a way to extract meaningful conclusions.
Feature filtering in financial data is important especially when using tree-based methods (e.g.: MDA — Mean Decrease Accuracy — see for example this article) because of “substitution effects” by which the model cannot consistently choose the variables of importance.

Numerai has also a forum where some people share their experience. This is also a great way to gather insights and in the end spare time by leveraging other people’s experiences!

3. Employ an efficient methodology

The path is long; there are a lot of computations to be done. You don’t want to have to repeat experiments just because you forgot how/where you did them.

You also don’t want to open an old notebook of yours, not understand anything and not be able to reproduce what was going on there… All of which happened to me (discipline can easily be lost when you try to squeeze some hours of the weekend for a side-project and have to take care of a baby at the same time…).

One of the main problems that I found was having to repeat computations due to dead kernels or other kind of interruptions. A simple approach that worked for me here is writing code that checkpoints each iteration of an experiment as a JSON file. As an example of a fail-safe, idempotent K-fold cross-validation:

For archiving experiments, one can save all training and model parameters along with each fold result in a JSON file. Those JSON files can then be opened from other notebooks to interpret the results.

Other things that helped me in general:

Isolating common code into independent python files that are later imported by notebooks. This “first-class” code can then be easily unit tested for better robustness and durability.
Keeping things organized (notebook titles / folders, etc).
Documenting every investigation step in text cells, such that one can quickly remember about something upon reading it again.
Moving some computations to the cloud through services like Weights & Biases and Google Colab.

4. The validation dataset is not a test dataset

Numerai provides you with a nice split for training and validation. It also provides you with nice “diagnostics” for a model based on the validation dataset. This is a bit like poisoned candy: you want to avoid the temptation of running multiple experiments and only selecting them based on the performance in the validation dataset!

Instead, just pick models that worked best in K-fold cross-validation in your experiments, and use the validation dataset to sanity-check that overfitting did not happen. Maybe check that the average correlation on validation is reasonable (≥ 0.02). Then, submit as many as possible to the tournament and see which one performs better after a few months.

Here are validation diagnostics of two of my models:

As of today, the one with worse diagnostics has outperformed the other one consistently for months.

5. Optimize a metric that works for you

I found it useful to visualize the K-fold Spearman correlation distribution as a box plot in order to have an overview of how each experiment performed. The box plot shows you clearly both the spread (related to the sharpe ratio), the average and median correlation, as well as the extremes.

What is it that you want from the competition?

If you just want to win and showcase some medals, you might want to pick models that have extremely high max(correlation), at the cost of having bigger spread and maybe lower average correlation.

In the first rounds, I submitted a model that was only trained on some parts of the training dataset (excluding outliers that I called “black swans”). This model made it to the top 200 and won several medals. After a while, it declined quickly and its performance changed completely. It knew how to perform really well under certain financial regimes, but very poorly under others.

If you want to have a consistently good model over the years, I think you would just pick models with higher average correlation. These would maximize the long-term returns.
If you have a big stake and might need to withdraw all or some of it urgently, you probably want to favor models with less risk, i.e. lower spread / variance.

In my experience, it is very rare to encounter a model that has both very low variance and very high average correlation.

Update: after this post was written, Numerai introduced True Contribution which provides yet a different idea (and a very interesting one) on how to weight models post-submission.

6. Beware of feature neutralization

“Feature neutralization” is a popular technique in the Numerai competition that generate models with lower variance / less risk, overall. It basically alters the predictions to lower the correlation of certain features which are considered “risky”, such that the model is not so exposed in case those features start to behave differently.

I feel like it might be too tempting to apply feature neutralization after-the-fact, without cross-validating the whole pipeline. In fact, I did it more than once over the last year. It is just too easy to submit several models with different feature neutralization schemes and test them live.

My conclusion is that if one wants to optimize average correlation (long term returns), feature neutralization makes little sense in my opinion. Here is the cumulative average correlation of a model with different degrees of feature neutralization. Blue has no feature neutralization at all, pink has some, gray some more and green a lot more.

A closer look into per-round correlations shows more or less the effect (green is the highest neutralized model, blue the no-neutralization-at-all):

In round 304 and 305 the neutralized model beat the other one, which scored negative correlations almost for one month. An advantage which is diluted by the fact that the cumulative returns of such a conservative model are much lower in the long-term.

7. There is no secret sauce

I might spare you some headaches here. As someone who has been watching the community for over a yer now, I can tell you that there is (probably) no secret recipe for a good model in the competition. Things that I read from people over the last year:

It’s all in the data! No fancy modeling needed.
It’s all in the modeling! No fancy data processing needed.
I created thousands of new features to find new signal, worked great.
I skimmed down everything to only 70 features, worked great.

For such a complex problem like the stock market, there is probably no one-size-fits-all, and one can probably find good models using different techniques. My conclusion: whatever you do, do it rigorously, and it will pay off. There are only some general principles/ideas that seem to work well for most: non-linear models (especially tree-based), ensembles and using various forms of groupings per “era” (time periods in the training data set).

For me, participating in this challenge has been so far a fun way of learning new concepts and strengthen my Machine Learning skills. As of now, my best model, a rather boring ensemble of XGB’s (all-in-the-modeling approach), ranks better than 95% of all models. Overall return during the year has been around 100%.

How much time should one put into this? For me, one hour per week has been enough. There have also been months where I haven’t done anything new. It is a very slow task because it takes month to see how a model behaves live.

So, my last advice: take it easy, and enjoy!