Over the past week, we focused on the work by Marcos Lopez de Prado, which consists of his main book (Advances in Financial Machine Learning) and various papers of similar ideas (such as “The 10 Reasons Machine Learning Funds Fail,” published in The Journal of Portfolio Management).
We found his ideas and suggestions incredibly insightful, confirming many of our own findings and also providing us with new, interesting perspectives (such as fractional differencing). Throughout his content, Lopez de Prado puts significant emphasis on the 10 problems he identifies as major failure points of machine learning-focused funds.
In this post, we would like to look into these 10 challenges and provide summaries, while also including some of our perspectives on each point. We agree with most of Lopez de Prado’s points; however, there are a few additions from our experience that we think could be useful to consider alongside the original solutions.
The 10 challenges are:
- Working in Silos
- Research Through Backtesting
- Chronological Sampling
- Integer Differentiation
- Fixed-Time Horizon Labeling
- Learning Side and Size Simultaneously
- Weighting of Non-Independent Identically Distributed Samples
- Cross-Validation Leakage
- Walk-Forward (or Historical) Backtesting
- Backtest Overfitting
Working in Silos
The idea of every portfolio manager working for himself may be useful in discretionary environments; however, it certainly doesn’t work for quantitative funds. Strategies in quant finance are intricate and require knowledge spanning various domains (data curation, processing, infrastructure, software development, feature analysis, execution simulation, backtesting, etc.). Expecting quants to produce strategies in silos is not a challenge but likely an impossible task.

The solution is to set up a research factory — to divide the goal of producing a profitable quant strategy into subroles (divided horizontally by domain). With this approach, quants can specialize on parts of the process with individual KPIs per role, instead of tackling the whole quest by themselves.
From our perspective, distributing roles strategically is definitely a good approach. However, there might be a large drop-off in performance after a certain degree of distribution, since processes become unmanageable and likely require additional talent to oversee. A high degree of role-distribution also likely affects the speed of execution and agility.
Research Through Backtesting
Applying backtests over and over on a strategy leads to overfitting and unrealistic outcomes in production. It doesn’t matter if it’s walk-forward or out-of-sample, re-running backtests after seeing results and tweaking parameters introduces bias and produces false positives.
The solution is feature importance analysis, rather than backtest analysis. After identifying the most important features used by the classifier, it is simple to conduct further specific research. This way, research is done at the data/input level and not at the result/output level.
The more features, and the more advanced the classifier, the harder it becomes to find a global maximum of good feature sets. Training the model twice on the same features might produce completely different feature importances (especially when using drop-out regularization or limited leaf depth).
From our experience, applying both input research and output research in cycles has shown to produce the best outcomes. Research should certainly begin at the input level. Then, however, by running multiple backtests per cycle (using a static parameter set) and aggregating their outcomes, averaged results can be used to optimize for a better input(feature selection)-output(parameter combination)-fit.
Chronological Sampling
One way to store financial data in a regularized format is aggregating it by time. This, however, is not ideal, since markets do not process information at a constant time interval. Time bars oversample information during low activity periods and undersample information during high activity periods. Data aggregated by time provides non-ideal statistical properties, such as low serial correlation, heteroscedasticity/outliers, and non-normality.

The solution is to aggregate bars by properties, such as ticks, volume, or — even more effective — dollar volume. Denominating volume in dollars cancels out the price dimension and retains only the quantity dimension (which can be further adjusted, for example, by market capitalization or outstanding issued debt).
According to our extensive experiments and tests (which confirm the findings of this data scientist https://towardsdatascience.com/ai-for-algorithmic-trading-rethinking-bars-labeling-and-stationarity-90a7b626f3e1) time bars are not much better/worse than tick, volume, or dollar bars.
+------------+---------------------+--------------+----------+
| | Serial correlation | Jarque-Bera | Shapiro |
+------------+---------------------+--------------+----------+
| Time bar | -0.079 | 78693 | 0.7924 |
| Tick bar | -0.0329 | 17138 | 0.8961 |
| Volume bar | -0.0226 | 161134 | 0.8318 |
| Dollar bar | -0.0377 | 22318 | 0.8763 |
+------------+---------------------+--------------+----------+When very few features are used, dollar- and tick-bars have better performance than time- and volume-bars. However, simply by combining features and including volume, tick, and dollar data into the model, we achieve much higher performance than any individual sampling method. From our perspective, it isn’t even necessary to aggregate one time series (such as price) by time series (such as time, volume, ticks, dollars). As our experiments show, it might be just as good (and maybe even better) to derive features from all time series independently and combine them horizontally.
Integer Differentiation
Time series in finance are generally non-stationary. A common approach is to add stationarity by differencing by one order (change, log change). This, however, comes at the expense of removing memory from the original series. Removing information through first order integer differentiation reduces the predictive power of the data.
The solution is to differentiate partially, which allows for retaining memory while making the series sufficiently stationary. This can be achieved using fractional differencing, as further explained with details and code-example in the book/papers.
Similar to sampling methods, fractional differencing appears extremely useful compared to traditional first order differencing or raw time series, but only in theory. Our findings show that applying fractional differencing is computationally expensive (especially when trading hundreds of models in real time and being required to generate high-frequency signals) and can be replaced by simply applying diffs (or log diffs) of different lengths and combining them as standalone features. This not only improves efficiency by multiple orders of magnitude but also performance in practice.
Additionally, we would like to propose a rolling window quantile/rank method, which retains memory and is stationary, since values are within 0 and 1 over any given window period.
Fixed-Time Horizon Labeling
A common approach in ML papers is to use fixed-time horizon binary labels (-1 for returns below a threshold, 0 for returns within a threshold, and +1 for returns above a threshold). The challenge with this approach is that 1) time bars don’t exhibit good statistical properties, 2) the usage of a constant threshold despite changing volatility is not optimal, and 3) the actual return might be different since positions could have been stopped-out or exited (take profit).
The solution is using the Triple-Barrier method. This allows for setting event-specific labels after a certain time interval (-1 for stopped out, +1 for exited via take profit, 0 for no event/time exit, or alternatively the sign of return after the time exit). Stop/exit targets are set dynamically (based on volatility), rather than being pre-defined as a static threshold.

Very critically, this method assumes that stops are necessary (trading smaller bet sizes in less correlated markets turns out better from experience). Furthermore, we think it is not a good idea in general to introduce execution properties (entry, exit, stop) at the machine learning step. By encoding volatility into labels (AND time horizon, AND execution properties), the labels increase in degrees of complexity and become harder to learn for the model.
The idea in itself is solid; however, we would like to propose splitting up prediction of returns (or the prediction of side and then accuracy of side in a second model, as proposed by Lopez de Prado in his following point) and prediction of volatility, which can be used to approximate execution parameters, such as exit (and stop, if necessary).
It should be pointed out that execution parameters should not be defined through complex (machine learned) functions, nor should they be pre-set (in the Triple-barrier method, these are pre-set based on a dynamic volatility multiplier). Setting exit/stop levels dynamically based on volatility is certainly better than setting these statically, from our experience. However, these parts are much better determined empirically.
Learning Side and Size Simultaneously
Learning side and size overcomplicates models and is a logically wrong approach. Side is a strictly fundamental decision (based on quantitative data), while size is a risk management decision (based on the individual situation).
The solution is to build a side model first (achieving high recall with low precision) and then a secondary accuracy model correcting for the low precision of the first model. The secondary model “confirms” the first model — to what degree/probability it is true or false.
Applying meta-labeling 1) allows to turn a black-box model into a seperated and logically understandable model, 2) avoids overfitting and classifies better since the two models have independent functions, 3) enables the creation of more sophisticated strategies (e.g., using different side models for different market regimes), and 4) avoids high accuracy on small moves and low accuracy on large moves (which have a big impact on the portfolio)
From our perspective, there is a simple way to combine side and size into one model (which works incredibly well and avoids double computations as when using two models). It might be as simple as choosing regression over classification.
By transforming labels between -1 and 1 (or 0 and 1, depending on the classifier), the predictions of the regression-model can be used as probability for each side. Depending on what the execution strategy aim is (focus on low-frequency big returns vs. high-frequency small returns), it might be useful to apply additional winsorizing/inner-cutoffs before transforming the labels to improve the intended evaluation metric of the model.
Weighting of Non-Independent Identically Distributed Samples
In financial ML, labels are decided by outcomes, and outcomes are decided by multiple observations. Since labels are not independent and overlap in time, it is not possible to map cause (feature) and effect (label) distinctively.
Overlapping labels make random bootstrapping inefficient, since in-bag samples are very similar to out-of-bag samples, due to high serial correlation in financial data.
The solution is to form a probability array for each observation which indicates the uniqueness of the label. Therefore, rather than drawing all samples simultaneously, samples can be drawn sequentially with controlled behavior (by adjusting for the probabilities of overlap and being able to draw observations with low uniqueness to the selected ones).
From our perspective, this is an incredibly important point for financial time series. It is especially useful to manually control the bootstrapping behavior when the available data is limited. This challenge is negligible, however, when working with higher-frequency data where labels span minute-intervals, rather than days or months. It also should be noted that existing (more sophisticated) ML frameworks offer in-built solutions for this challenge.
Cross-Validation Leakage
K-fold cross validation fails with financial time series, since observations are not drawn from an independent distribution. Traditionally splitting data into training and testing sets doesn’t work, since financial data is serially correlated, causing leakage which leads to false discoveries.

The solution is purging and embargoing. Purging eliminates overlapping observations between the training and testing set. Embargoing enforces a window/distance of certain length between the split of training and testing data to avoid any possible information leakage.
Similar to the previous point, this challenge may be insignificant, depending on the size of the largest interval (susceptible to leakage) compared to the total data size. Assuming 5 years of daily-bar data is available, while the features include a 30-day moving average, the necessity for purging/embargoing is quite high (using ML in this kind of low-frequency and overlapping data is probably a bad idea in general). Considering another case, where 5 years of minute-bar data is available while the longest window allowing for leakage is only 30 minutes, this challenge becomes rather irrelevant.
Walk-Forward (or Historical) Backtesting
Walk-forward backtesting is testing a single historical path, and therefore it can easily overfit. Furthermore, it is not representative of future performance, since results are based on a particular sequence of data points. Lastly, by using walk-forward, a usually large part of data is used for training; therefore, WF backtests have a smaller portion of data for testing.
The solution is combinatorial purged cross-validation. This means simulating a large number of scenarios where each scenario provides an individual backtest path (and thus also allows to use all parts of the data for backtesting uniformly). Data is partitioned into a number of possible train/test splits based on different windows from the original data. Testing multiple (hundreds, thousands) of backtest paths allows for obtaining a distribution of Sharpe ratios (opposed to a single one, which is likely overfit).
In our view, walkbackward backtesting is generally not a good idea, since it assumes markets don’t evolve and all market participants act independently of previous participants (and this is certainly not the case, since if actor A changes his strategy, actors B and C will be required to adapt). Therefore, combinatorial cross-validation using more recent training than testing intervals seems just as questionable as using a walk-backward backtest.
We believe a single-path walk-forward backtest is the most reliable solution, as long as backtests are not used for research (pointed out by Lopez de Prado in his second challenge) and the testing period consists of multiple cycles, depending on the chosen data frequency. Walk-forward on daily data is biased and easily overfit, since cycles on daily timeframes span multiple years. In such cases, aggregated CPCV results are likely better than individual WF/WB backtests, in our opinion. However, the more robust solution would be to simply use higher-frequency approaches.
Backtest Overfitting

Backtests are easier to overfit than likely anticipated. Assuming a true Sharpe ratio of zero, it takes only 1000 independent backtest iterations to discover a backtest with a false positive Sharpe ratio of 3.26! The reason for this is that a large portion of decisions are based on a small portion of the dataset.
The solution is to control the number of backtest trials/iterations to be able to offset the discovered Sharpe against the false discovery rate. It is recommended to incorporate the number of trials into the Sharpe ratio, using the proposed deflated Sharpe ratio, which adjusts the estimated Sharpe ratio by the probability that SR (estimated) is greater than SR (hypothetical) after a certain number of trials.
We find the proposed solution extremely useful. We would like to add another interesting possibility to avoid backtest overfitting, which might be to replace individual backtests with aggregates of multiple backtests (based on static parameter sets), therefore making better sense from the distribution of metrics (e.g. distribution of Sharpe for all parameter sets) and accounting for robustness of the model under slightly different parameters/market circumstances.
Resources
- https://github.com/hudson-and-thames/presentations/blob/master/Quantcon%202018%20NYC.pdf
- http://www.smallake.kr/wp-content/uploads/2018/07/SSRN-id3104816.pdf
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3197726
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3120557
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3193702
- https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086
- https://www.oreilly.com/library/view/advances-in-financial/9781119482086/

