SIMULATIONS, RISKS, AND METRICS
AI in Finance: how to finally start to believe your backtests [1/3]
On the dangers of walk-forward backtesting, how to measure them and not feel right, but to be right
Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.
Quantitative research is a process with many intermediate steps, each of which has to be carefully and thoroughly validated. Asset selection, data collection, feature extraction, modeling — all these phases take time and are delivered and tested by different teams. But what at the end the investor wants to see? That “flawless” backtest on historical data with high Sharpe ratio, alpha with respect to the market, and maybe some fund-related metrics as capacity, leverage, average AUM, etc. However, such an approach doesn’t tell us anything about the future performance of the strategy and such backtests can be easily overfitted and are even dangerous when they’re used as a measure of “success” of a trading idea (see more details in my article on financial idea discovery).
There is nothing easier than producing a backtest of a strategy that performs well in the past, but will you risk and bet your capital on it? Most of investors do.
In the next couple of articles, I would like to completely review the idea of testing the trading strategy itself.
- First, to explain the dangers of a standard approach and show some alternative metrics that give more insights about the strategy performance.
- Then, I would like to show how we can get rid of the single walk-forward backtest in favor of simulations, scenarios, and probabilistic interpretations of well-known metrics. This will help us to switch from an analysis of the past to the estimation of future performance.
- Last but not least, I would like to enrich the notion of risk: normally investors and portfolio managers calculate and take into account the risk of the portfolio, but not of the strategy itself, which leads to the inability to control and understand the strategy in live performance.
Like most of my recent articles, this one is inspired by books of Dr. López De Prado and I recommend them to dive deeper into the topic. As always, the source could you can find on my GitHub.
Dangers of Backtesting
In 2014 a team of practitioners at Deutsche Bank, led by Yin Luo, released a work called “Seven Sins of Quantitative Investing”. It talks about things you have most probably already have heard about: survivorship bias, look-ahead bias, storytelling, data snooping, transaction costs, outliers, and short issues. I expect that we already know how to get the right data and process it right, not to look into the future, and perform feature importance analysis instead of explaining random patterns on the backtest.
However, to that crucially important list, I would like to add some additional moments that are doubting the idea of the backtest itself:
- It is a single path of a stochastic process. Markets could be described with an extremely complex stochastic process with millions of variables. We could sample different scenarios and outcomes of such a process, but we observe only one that actually happened. We rely on a single outcome of infinity from a tremendously complex system to test a trading algorithm, which is insane by itself.
- It doesn’t explain the financial discovery. Backtest tells what could’ve happened in the past, but it doesn’t explain why. Relying on the Sharpe or Sortino rations means staying blindfolded in the world of uncertainty. Alternative approaches such as analysis of feature importance will allow us to build financial theories first and only after testing them with the backtesting or another scheme.
- It doesn’t allow us to forecast performance and risks. Your strategy could survive in the crisis of 2008, doesn’t it mean that it could’ve passed the current COVID one easily? How do you know that you didn’t overfit to one particular scenario? Or a particular set of factors? How to estimate exactly the probabilities of such outcomes? The classical backtest doesn’t give an answer to anything of that.
To summarize, even if we have avoided the deadly sins by the guys from the Deutsche Bank, our backtest tells us only about how some set of rules would’ve performed in the past and nothing else. That’s why it’s crucially important to have a richer strategy evaluation toolkit that allows probabilistic, scenario-based, and risk-focused approach, that we employ at Neurons Lab and which we are going to develop and test out within these couple of articles.
AI-based strategy overview
Let’s define a typical forecasting-based strategy: there is a signal source of going long or short on some instrument and we act accordingly. For simplicity, we will perform a typical time series forecasting exercise: taking N previous days, extract the factors, predict price change for the next day, and trade. This is a very primitive scenario, that doesn’t take into account transaction costs, slippage, shorts, take-profit, and stop-loss barriers, but let us focus on the discovery of a market understanding principles first, and on the execution later — we will always have time to discard a strategy because of bad fit to the market/exchange rules.
Underlying data
Let’s focus on the banking sector and will learn the signals for speculation with such tickers as C, DB, BAC, WFC, and other known banks. Because of their crush in the 2008 crisis, it will be interesting to see how our ML models will deal with this regime change. We will rely mainly on the statistical factors that we will group into 3 categories:
- Statistical features: min, max, autocorrelation, and statistical moments
- Technical indicators: classical features popular among traders
- “AFML features”: for more details check the celebrated book :)
We will extract these factors in rolling window fashion (statistical ones on the fractionally-differentiated time series) and sample inputs and outputs with respect to the iid rule. The input will be rolling features “today”, prediction — fixed horizon close price change for the next day.
Forecasting model
Following the advice of Dr. López de Prado, we will use the bagging of decision trees, with a correction towards iid sampling. We will expect that ensembling will help to deal with overfitting and tree-based model will help in further feature importance analysis. In general, we want to keep the forecasting model as simple as possible, focusing on the quality of our factor insights.
Hyperparameters
We will train our model on 5 years of data, then trade with this model for the next 3 years and after that repeat the process until the data ends (from 2000 to 2020). This way we supposed to fit to constantly changing regimes at least to some extent.
“Normal” backtesting and analysis
Let’s start with Deutsche Bank (DB) ticker and run the above-mentioned strategy and calculate returns, strategy performance, and Sharpe ratio of the benchmark (buy-and-hold) compared to the ML strategy.
What about the numbers?
- Average returns: -0.00026 of benchmark, 0.0008 of ML strategy
- Sharpe ratio: -0.14 of benchmark, 0.74 of ML strategy
Looks pretty good, but what actual insights do we get from this backtest? Literally none! Maybe our model just has overfitted to tell “short” signal almost all the time on the bearish market and will fail horribly in the future :) Let’s review more interesting metrics that could help us to understand model’s performance in more detail!
Alternative metrics
Data statistics
- Feature group importance: showing what data source was important at which moment (market, fundamental, sentiment, alternative), we can detect regimes and partially explain patterns
- Feature exposure: if a model relies too much on some particular feature, it can become unreliable in the future. We can track such inconsistencies and consider them as a source of a risk
Model statistics
- “Accuracy proxy”: Matthew's correlation coefficient (MCC) is a relatively general measure of “accuracy” of a predictive model that can give insights about how much we can rely on the model. Also, it can be helpful for detecting regimes, where our model is unreliable.
- Model certainty: the model returns probabilities, we can track how confident the predictions are and study how confidence affects risk and returns
Efficiency statistics
- Sharpe ratios (annualized, probabilistic, deflated): basic performance metrics that, however, have to be fixed for fat-tail and skewed distributions and multiple testing.
- “Smart” Sharpe ratios: we want our strategy to lack autocorrelation, be memoryless, and hence, not have any long burn periods. If we penalize our Sharpe with autocorrelation, it can help us to choose such strategies via optimization.
- Information ratio: this metric helps us to compare our strategy to the underlying or the benchmark beyond the “alpha”. It is the annualized ratio between the average excess return and the tracking error.
Runs statistics
- Minimum Required Track Record Length: answers the question “How long should a track record be in order to have statistical confidence that its Sharpe ratio is above a given threshold?” If a track record is shorter than it, we do not have enough confidence that the observed Sharpe ratio is above the designated Sharpe ratio threshold.
- Drawdown & Time Under Water: investors are interested not only in the risk of the maximum drawdown but also how long they are going to stay there. Combined with data and model statistics it can give insights about why and when the strategy will not perform well.
Generalization statistics
- Market generalization: since we aim to operate in a specific sector, we look for the factors that are consistent across different instruments, so we know, that we didn’t “overfit” our discovery just for one single security. There is no standardized way to measure such generalization, so we will improvise :)
New metrics and insights
First, let’s calculate all the above-mentioned metrics both for the benchmark and the strategy on DB ticker:
We can see, that in terms of main performance and information value metrics our ML-based strategy indeed outperforms baseline! What about overfitting? Probabilistic SR equals 1.0, which means the discovery is true, and deflated SR (calculated from multiple runs of Bagging model) also equals 1.0 which confirms that our model is also adequate. Average MCC is pretty high and even divided by the standard deviation of MCCs (some sort of SR for accuracy score) has high value. What about feature importance? We can see, that some features are stable at the bottom of the importance, and some are very high. We see empirically, that volume-based factors are performing well on this ticker.
If we already calculate feature importance, why we don’t apply it for making the results better? For example, we could remove the bottom-3 least important features (assuming they mean nothing). Will it change the performance?
Visually we can confirm, that out updated strategy with updated feature set (it’s re-evaluated every 5 years) beats our baseline of ML strategy that uses all the features. From the metrics points of view, we also see the improvement but note the MRTL (Minimum Required Track Record Length) feature that now tells us that we need many more days of observing both strategies to indeed confirm the boost (and it makes sense since they’re extremely correlated). Also, note the drop in the average certainty — removing some features could actually hurt the model.
Now we need to ensure that our strategy and hypothesis about the non-linear dependency of given features can hold on other banking tickers.
Can we generalize to other tickers?
BAC ticker (Bank of America)
Well… it doesn’t look very well. If we examine the metrics in detail, we can see that this is a clear fault of ML models. “Important” features also differ a lot from the Deutsche Bank example. Maybe it’s just one outlier, let’s try the next one!
WFC ticker (Wells Fargo & Co)
The next try is also not profitable at all! However, here the situation is even more tricky: MCC metrics are not negative (even very close to zero), but strategy performance is horrible. Let’s check more examples, but for now, it’s worth considering the idea that Deutsche Bank's example was just a lucky one.
C ticker (City Group)
Check the graphs! Do you see it? Finally, we got it right! Wait, let me check that deflated SR? Zero? Sign of overfitting… and what about MCCs? Again close to zero! Do we really think that a model that is almost random could beat our benchmark so good? No way we can trust this result!
CS (Credit Suisse Group AG)
Alright, if it worked so well on DB, maybe it is about being in Europe? Let’s check Credit Suisse stocks then! And… we fail again :)
HSBC (HSBC Holdings plc)
Just for curiosity, I’ve tried the strategy with another European bank — HSBC, and visually it looked very positive, but again, the devil lies in the model performance and MCCs — we just got lucky here again with ML models that actually didn’t learn anything useful.
ING (ING Groep NV)
I wanted to finish experiments on a positive note, however, miracles, if happen, not in quantitative finance, especially if we measure everything right ;) Good-looking from the performance and MCC strategy, unfortunately, has deflated SR of zero — a big chance of not performing well out-of-sample. Also, it performs worse with removed bottom features, which is weird by itself.
Conclusions
The main outcome of the work we did is straightforward — we have learned how to evaluate a backtest of an ML-based strategy from many more dimensions than we used to. We need to fight overfitting, multiple comparisons, and “black-boxness” of our models and to fix these issues we need to spot them first. New proposed statistics do the following:
- explain factors in the data even with non-linear models
- explain the connection between model and strategy performance
- enrich the notion of “beating the benchmark”
- give a hint about generalization and overfitting
However, the results were discouraging but I hope, no one expected that we can actually beat the market and create new economic findings with a couple of formulas from the book :) Checking important features we could see, that there is no consensus between different tickers — all the time different factors are in the top, which doesn’t allow us to build a strong theory about statistical factors that drive banking stocks. What we could do to actually make it right?
- Find new features that define differences between banks — fundamentals, macro, alternative data — in 2020 it’s not very smart to rely on statistical factors ;)
- Change hyperparameters — who said that 5 years for training, 3 years for trading, 14 days averaging window, etc and correct numbers? We need actually to run a hyperparameter search to find optimal values. We will check it in detail in the next post, meanwhile, check this article of mine (many mistakes here and there, but the search part is correct)
- Work more on feature importance — we see that almost in every case removing that bottom features helps ML strategy a bit. Maybe we need to remove more? Or apply a better algorithm for feature importance? Or since we observe overfitting, maybe we can remove not only the bottom but also top features? ;)
In the next blog post, we will dive much deeper into the notions we have introduced today. Instead of calculation of those numbers on one historical dataset, we will develop schemes for simulations and data augmentations and will come up not with the point estimates, but with probabilistic interpretations. Who knows, maybe our models performed badly on one time series realization, but at the scale, they’re pretty reliable? Stay tuned and don’t forget to check out the source code :)
P.S.
If you found this content useful and perspective, you can support me on Bitclout. You also can connect with me on the Facebook blog or Linkedin, where I regularly post some AI articles or news that are too short for Medium and Instagram for some more personal content :)