AI in Finance: advanced idea research and evaluation beyond backtests

Feature importance: a financial research driver that won’t let you down

Published in

The Startup

11 min readFeb 8, 2020

https://blog.darwinex.com/wp-content/uploads/2018/11/backtest-darwins-en-1030x426.png

How do we usually do trading ideas research? We get some data with the hope of having some predictive value there, prepare it, extract potential alphas, train forecasting models to predict prices based on that alphas and after run some long-short strategy backtest based on the trained model. If it looks good, we are happy to invest our money there. Looks pretty legit, right? Although, not everyone agrees with it.

More and more practitioners in financial machine learning are diving deeper than checking backtest Sharpe and maximal drawdown alongside other popular metrics

That backtesting process itself is extremely tricky, you can check it yourself in Dr. De Prado’s book or his lecture slides and you most probably already know about it from your experience. But there is a way to check ideas before actually doing backtesting — via analyzing ML models behind and their feature importance. Why it can be beneficial?

Investment management researchers are often doing research stating the hypotheses as “strategy based on features X and model Y will get Sharpe ratio Z”. However, research is about forming the hypothesis about the cause-effect relationship between financial variables. Backtesting is just one of the tools for validating the business value of the confirmed hypothesis.
Since classical backtesting is performed on the historical data, we tend to explain random patterns and correlations (that don’t explain causations) of the simulated trades in the past to justify our assumptions. Later, based on these “findings” we try to “fix” our models and strategies several times, which leads to the next problem.
Backtesting is prone to multiple hypothesis testing — the more you try to optimize a strategy based on the backtest on a single security, the more likely you’re diving deep into the false discovery. Hence, it’s very easy to “overfit” predictive models to perform well on the backtest, but that’s what we definitely don’t to put our capital on.

In this blog, we will see what are the methods to evaluate trading ideas without backtests on the examples of several data sources. We will see, that this is not just a tool for machine learning practitioners, but a strong framework for checking business hypotheses apart of the “trading” or portfolio management metrics. For practitioners, as always, the source code of all the experiments is on my GitHub.

Practical benefits

Apart from the benefits that are related to research and development (as mathematical validation and correctness of the ideas), there are several side outcomes that might be handy for financial institutions. While working with our customers in Neurons Lab, we have experienced the following scenarios where we could apply feature importance analysis for direct benefit:

If you’re constantly evaluating new data sources and market and alternative data research for inner use or for sale, correct feature importance is crucial to identify what data has real predictive value;
In live trading, selecting only important features for making trades can serve as one of the ways to reduce risks related to machine learning models performance decrease;
Tracking feature importance evolution in time, we can explain economical factors dynamics better than conventional econometric models, this also serves for risk management and early trend detection.

Feature Importance

In machine learning and mathematical modeling, no one likes the “black box” things. Even for complex deep learning models in NLP and computer vision, there are ways to visualize their performance and flaws:

Interpretation of the deep neural networks in computer vision and NLP is a common thing, why we don’t do it in Finance yet?

By the way, the well-known in statistics p-values is not just not the perfect way to check feature importance, it is even not recommended by ASA. What we have here for Finance?

Single Feature Importance: works for all classifiers, OOS (out-of-sample), analyzes every variable separately which might be a problem if some features work together. Also, takes a too long time with a large number of features;
Mean Decrease Impurity: IS (in-sample) method designed specifically for tree-based algorithms (decision trees, random forests) based on the inner tree branch splitting;
Mean Decrease Accuracy: an interesting OOS method that can work for any algorithm and what’s important, it directly compares how accuracy decreases after permuting the column with the feature of the interest.

You can read more about the above methods in De Prado’s book. Also, there are a bit more advanced ways to do it:

Clustered Feature Importance: when two features share information, shuffling in MDA may result in a reduction of performance, that’s why it makes sense to cluster features that “go together” and shuffle clusters instead, read more here;
SHAP: is a method to explain individual predictions. SHAP is based on the game theoretically optimal Shapley Values, it is very widely used in modern machine learning practice, read more here.

Last but not least, we would like to have a way to double-check the “adequacy” of the correctness of the important features found. We can do it with the help of PCA, which is unsupervised, and, hence, can’t overfit to labels. We can calculate the “correlation” (weighted Kendall’s tau) between PCA “feature importance”, i.e. eigenvalue magnitude, and any other feature importance method. We expect this measure of similarity to be at least somewhat positive.

A note on cross-validation

Since for most of the feature importance we need to train models and evaluate their accuracy with respect to the features, we need a way to test this accuracy correctly. Of course, it should be out-of-sample data, but in practice, we don’t just take one “piece of data in the future” and evaluate the performance there. Some typical ways to do it combinatorically are K-Fold cross-validation and Time Series cross-validation, you can read more about them here. In Finance, because of various reasons like not IID-sampled observations and leakage of the features, we should separate train and test sets more strictly, adding overlaps between not just train and test sets, but even different folds. You can read more about Purged K-Fold cross-validation here, but intuitively it looks like below:

https://www.slideshare.net/TaiLiLuo/machine-learning-time-series-analysis-finlab-cto

Simulated data experiment

First, to evaluate the whole pipeline, we can experiment with simulated data for binary classification, where part of the features is predictive, part of features is redundant (i.e. just a combination of predictive features) and some of the features are completely random. Our feature important methods have to show us what features are not important, so we can rely only on real “alphas” in our “trading”. To make it a bit closer to trading scenarios, we will make a dataset with a very low signal-to-noise ratio: 3 informative features, 2 redundant features, and 15 random features denoted as “I”, “R” and “N” respectively. So, let’s see what feature importance algorithms showed us:

MDI, MDA and SFI feature importances on simulated data

CFI, SHAP and PCA feature importances on simulated data

Visually, all methods but SFI performed well in identifying “I” and “R” features as top-5. Also, if we calculate weighted Kendall’s tau between PCA eigenvalues and CFI importances, it is 0.44 which shows high correspondence.

Market data experiment

Now, let’s take real financial time series and repeat the experiment. It could be APPL from 2000 to 2020. First, we make the time series stationary with fractional differentiation and extract the following feature set:

5 statistical features: mean, standard deviation, skewness, kurtosis, the autocorrelation of fractionally differentiated close prices;
5 trading indicators: RSI, OBV, ATR, Hilbert Transform period and phase;
5 random “features”: normal, uniform, binomial, Poisson and logistic distributions that will represent non-informative features from different sources.

Next, binary classification labels are created with triple barrier labeling. The rolling window for the features was 14 days, the prediction horizon 7 days. More about differentiation and triple barrier you can read in my previous blog post. Also, check the source code for current preprocessing in my GitHub. How feature importance will behave in this setting? We will check features on the data from 2000 to 2010 and then, test “out-of-sample” performance with different feature sets om data from 2010 to 2020.

MDI, MDA and SFI feature importances on AAPL dataset from 2000 to 2010

CFI, SHAP and PCA feature importances on AAPL dataset from 2000 to 2010

We can clearly see, that all of the approaches still give at least some of the random features very high priority.

Well, no one said that the performance will be perfect :) But at least now we can cut off most of the trash features and reduce our risks in the future. Also, it would be smart to check feature importance on the same data period, but from other similar assets, for example, MSFT and IBM. It also could give us ideas about real market drivers. Let’s visualize MDI, MDA, and SHAP:

MDI, MDA, and SHAP feature importances on MSFT dataset from 2000 to 2010

MDI, MDA, and SHAP feature importances on IBM dataset from 2000 to 2010

Based on top-5 SHAP values from these three assets, we can select as the most important features following ones:

CUSTOM_IMPORTANT_FEATURES = [
    'feat_mean_frac_close', 
    'feat_OBV_volume', 
    'feat_kurt_frac_close', 
    'feat_ATR_close', 
    'feat_std_frac_close', 
    'feat_HT_DCPERIOD_close'
]

Now we run Purged K-Fold cross-validation on OOS data only and calculate several metrics. Since RandomForest uses bootstrapping in its core (this is an issue I would like address in another blog post) and results may vary from run to run, we can run 100 experiments and average them to calculate metrics as F1 score, Matthews Correlation Coefficient, and ROC-AUC score:

Out-of-sample (AAPL 2010–2020) Purged K-Fold cross-validation results based on different metrics

As we can see, on average models trained on the carefully chosen set of features in-sample, perform better out-of-sample!

Numerai data experiment

Numerai is the decentralized financial forecasting challenge, where users get obfuscated and anonymized financial data, create and submit the models, that are later combined in a meta-model that is trading on the real market. Based on the contributions to this meta-model and its performance in live trading, users get paid in crypto. The problem is that we have no idea about data and the labels: the names are encrypted, labels and binned, features are shuffled within each time period (so-called era) and we can’t apply any of our financial knowledge to get the most from it. It looks like the only thing we can do is iterate over hyperparameters of XGBoost, but let’s see if feature importance analysis can get us somewhere :)

A sample from Numerai dataset: obfuscated, shuffled and binned features to predict from. A real challenge!

What if eras are actually mixed too? Let’s better do normal group K-Fold cross-validation with eras corresponding to the groups (for speeding up the process I’ve grouped eras into 10 “buckets”). Also, to speed up the calculations, we will use the linear regression model here and use only MDA as a feature importance analyzer on the training data.

Click image to see all MDA feature importances for Numerai dataset (if you really feel like)

Let’s select the most important features now. Since features in Numerai datasets are grouped into 6 categories (‘feature_charisma’, ‘feature_constitution’, ‘feature_dexterity’, ‘feature_intelligence’, ‘feature_strength’, ‘feature_wisdom’), let’s select top-75% from each of them to save the consistency based on the MDA feature importances and run cross-validation experiments on out-of-sample data. Here histograms represent different eras on the validation set. The ML model in all the cases was a simple linear regression.

Correlation score, Numerai score (ranked correlation) and MSE on out-of-sample Numerai dataset

We can see, again, that model trained on the “cleaner” dataset, even we didn’t know what every feature means, and relied purely on the MDA values, performed better compared to the models trained on the full dataset. Still, we don’t know how it will perform it live though, but it is already a good starting step.

Conclusions

As we can see, to form our hypothesis about financial markets and evaluate them, we indeed need a machine learning and feature importance. Backtesting can come later when we will define actual trading and risk management rules, but when we are thinking about new variables, new factors or new alphas that will help us to beat the market, backtesting is a premature step on this stage. We could see that feature importance of machine learning models alone can help us to identify true market drivers to build profitable strategies later over them.

You also might think about the idea that correlations and dependencies in the markets are constantly changing, you might want to track not just average feature importance in the past, but how it changes over time! At Neurons Lab it helped us to build robust trend spotting and risk management systems for our clients. Ping us if you need a hand in implementing such solutions for yourself. In case you’re an experienced practitioner, check the code for the experiments in this article in my repository.

I hope you enjoyed the article, in the next one we will focus more on how to trust the backtesting process itself, which is still necessary if we’re talking about building profitable and robust trading strategies and which is so often done wrong today by many practitioners and institutions.

P.S.
You also can connect with me on the Facebook blog or Linkedin, where I regularly post some AI articles or news that are too short for Medium and Instagram for some more personal content :)