From Data to Dollars: Advanced Techniques in Trading Strategy Backtesting (I/II)

13 min readNov 26, 2023

Intro

When it comes to backtesting quantitative trading strategies, there’s a fine line between a rigorous scientific approach and blindly trusting that every strategy with a nice equity curve would yield a Ferrari and blondie sitting next to you. Now, imagine if Newton had “backtested” gravity by repeatedly dropping an apple on his head. Sure, he would have a bump to show for it, but would he really understand gravity better? Probably not, unless the apple’s descent was paired with a meticulous observation and analysis, much like a trader scrutinizing a backtest.

So, let’s embark on a journey through the land of backtesting, where we’ll avoid the siren calls of overfitting and curve-fitting — those devious creatures that can make a strategy look as promising as a diet plan before Thanksgiving. We’ll learn to distinguish a truly robust strategy from a fluke, and we’ll do it with the rigor of a scientist and the skepticism of a cat offered a new brand of food. Remember, in the world of quant strategy backtesting, it’s not just about whether the apple falls; it’s about understanding why it didn’t go up instead.

Overfitting in Financial Backtesting

Definition: Overfitting occurs when a statistical model or algorithm captures the noise of the data rather than the underlying pattern. In the context of financial backtesting, this happens when a trading strategy is excessively tailored to fit the historical data, making it perform exceptionally well in backtests but poorly in real-world trading.
Causes:

Too Many Variables: Incorporating an excessive number of indicators or variables can lead to a model that is too complex for the actual data pattern.
Data Mining: Repeatedly testing various combinations of parameters and selecting the one with the best backtest performance.
Short Data Samples: Using a very limited period for backtesting, which may not be representative of different market conditions.

Consequences:

False Confidence: Overfit models create an illusion of high performance and reliability, leading to misplaced confidence in the strategy.
Poor Future Performance: Strategies that are overfit to historical data often fail to adapt to new or changing market conditions.

*Illustration from* *https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3544431*

Now that we understand the concept of overfitting let’s move to concrete example and backtest metrics to spot the issue.

Strategy Setup for Financial Backtesting

Objective: The primary focus of this backtesting exercise is to evaluate a momentum-based trading strategy specifically designed for the BTCUSDT pair, utilizing 1-hour candlestick data. The strategy is fundamentally anchored in identifying volatility breakouts as indicated by Bollinger Bands, with predefined levels for taking profits.

Data Utilization: For this analysis, the dataset comprises minutely trading data sourced from the Binance spot market, spanning from the inception of the dataset in late 2017 through to November 2023. Although the primary analysis is conducted on hourly data, minutely data are imperative for accurate simulation purposes. This granularity is crucial for a comprehensive understanding of the intra-hour price movements, particularly the high and low prices, which are integral to the strategy’s efficacy.

Dataset Division: The dataset is bifurcated into two distinct sets for robust evaluation:

Training Set: Data ranging from August 2017 to January 2022, which will be used to develop and optimize the trading strategy.
Testing Set: Data from January 2022 to October 2023, reserved for validating the strategy’s performance and its adaptability to different market conditions.

Transaction Costs: A realistic approach is adopted by factoring in transaction costs and slippage. For each trade, a commission of 5 basis points (bps) is applied, equating to a total of 10 bps for a complete round-trip transaction. This consideration is critical for assessing the net profitability of the strategy in a real-world trading environment.

Parameter Optimization

The optimization process for this strategy involved fine-tuning five key parameters: the period and type of moving average, two distinct take profit levels, and the Bollinger Bands thresholds which serve as our entry points.

To navigate the vast parameter space efficiently, we employed the Optuna optimization framework, conducting 100 individual trials to pinpoint the most effective combination. The results of this systematic search were encouraging, demonstrating the potential of the strategy under optimized conditions.

Monte Carlo Simulation in Financial Forecasting

The Essence of Monte Carlo Simulations: The Monte Carlo simulation method is a powerful statistical tool used to understand the behavior of systems affected by uncertainty. By creating a multitude of virtual scenarios, we can forecast the likelihood of various outcomes. This method is not just confined to the opulence of Monte Carlo’s gambling scene but extends to practical applications like financial forecasting.

Financial Application: In the realm of finance, Monte Carlo simulations are employed to generate potential future asset price trajectories. Each path we observe in the market is essentially one possible outcome from a plethora of probabilistic events. By assuming a known distribution for these outcomes, based on observed data, we can simulate the future paths an asset’s price might take.

Brownian Motion — The Mathematical Underpinning: Borrowing from the world of physics, financial theorists have adopted the concept of Brownian Motion to describe the random walk of asset prices. Although delving deep into the mathematical intricacies of Brownian Motion is beyond this article’s scope, it suffices to say that it provides the foundation for modeling asset prices with the stochastic differential equation:

Here, S represents the asset price, μ is the expected return (mean of returns), σ is the volatility (standard deviation of returns), and W is the Brownian Motion or the random component that introduces the stochastic nature to the equation.

Approach to Realism: While it’s convenient to consider μ and σ as constants, a more nuanced approach acknowledges their variability. In this analysis, we’ll adopt a more realistic model where μ and σ are time-varying. These parameters will be estimated on a rolling basis, using sets of 100 observations. This allows us to capture their evolution over time and create a series of N=100 simulated paths for the BTCUSDT price, incorporating the randomness introduced by the stochastic component W.

Applying our estimated strategy parameters to these paths yields 100 strategy results:

*Strategy results for simulated price paths*

*Sharpe distribution for simulated price paths*

Strategic Application: By applying the optimized strategy parameters to these simulated paths, we can obtain a range of outcomes for the strategy’s performance. The results indicate an average Sharpe ratio of 1.9 across the simulations. This is notably lower than the in-sample (IS) Sharpe ratio of 3.1, with the 5th percentile standing at merely 1.3. Such a discrepancy suggests that the strategy, while robust under historical conditions, may not hold the same level of effectiveness under varied market scenarios, underscoring the importance of out-of-sample (OOS) testing in strategy validation.

Time Series Cross-Validation: Ensuring Model Integrity

Challenges of Time Series Data: Traditional K-fold cross-validation, with its random partitioning, is ill-suited for time series data due to the inherent temporal sequencing of observations. Disrupting this order can lead to evaluation that doesn’t faithfully represent a model’s predictive capabilities.

Adapting to Time Series:

Sequential Partitioning: This method honors the time series nature of data by dividing it into contiguous sections, thereby preserving the chronological order and ensuring the validity of temporal patterns.
Forward Chaining: This approach, akin to a rolling forecast, incrementally expands the training dataset fold by fold, aligning closely with how models are utilized in real-world situations, where predictions are based on all available past data.
Preventing Data Leakage: By preventing the incorporation of future data into the training process, time series cross-validation circumvents the risk of information leakage that could unrealistically bolster model performance.
Holistic Evaluation: Performance metrics are computed for each fold and then synthesized to gauge the model’s effectiveness across the entire time period, offering a comprehensive view of its predictive power.

Practical Application: Initially, the data from 2018 to 2020 served as the training set, with the subsequent years 2021 to 2022 designated for testing. The in-sample results for 2018–2020 show strong performance metrics. However, the out-of-sample results exhibit a reduction in the Sharpe ratio by 30%, hinting at the possibility of overfitting or perhaps just an unfavorable period for the strategy’s mechanics.

Monte Carlo Analysis: A Monte Carlo simulation, conducted for the 2020–2021 period using the optimized parameters, reveals that while the mean Sharpe ratio aligns with the actual data, the lower percentile Sharpe approaches zero, signaling a risk of poor performance in adverse market conditions.

Expanding the Analysis: By extending the training period to include 2021, and maintaining 2021 to 2022 as the test set, we observed an even steeper decline in the Sharpe ratio out-of-sample, reinforcing concerns of overfitting. The corresponding Monte Carlo simulation mirrors this result, with mean Sharpe ratio hovering around 2.2 and a lower percentile perilously close to zero.

Conclusion: The walk-forward cross-validation approach coupled with Monte Carlo simulations paints a cautionary picture. The consistent decline in the out-of-sample Sharpe ratio, across both test sets, and the precarious lower percentiles observed in the simulations suggest an instability in the strategy’s performance when confronted with unseen data, underscoring the critical need for rigorous testing methodologies in strategy development.

Assessing Overfitting Risk: The Role of PBO

Understanding PBO: The Probability of Backtest Overfitting (PBO) offers a quantitative measure to gauge the risk that a trading strategy’s historical performance is inflated by overfitting. Developed by industry experts like David H. Bailey and highlighted in the work of Marcos López de Prado, PBO provides a statistical framework to assess the authenticity of a strategy’s predictive capabilities.

Methodology for PBO Calculation:

Matrix Formation: The process begins by aggregating the performance data from N strategy trials into a matrix M, encapsulating a time series of outcomes for various model configurations.
Matrix Partitioning: The matrix M is then divided into S submatrices with uniform row dimensions, ensuring that each submatrix contains a fragment of the performance data.
Submatrix Analysis:

Submatrices are grouped into distinct combinations for analysis.
These combinations are then used to form training and testing datasets.
Performance metrics are calculated for each dataset.
The training set’s optimal strategy is identified.
The out-of-sample rank for this optimal strategy is determined.
A logit function is applied to quantify the consistency of performance between in-sample and out-of-sample datasets.

PBO Estimation: The PBO value is derived by analyzing the distribution of out-of-sample ranks and determining the likelihood that strategies excelling in-sample do not maintain their performance out-of-sample.

Interpreting the PBO for Our Strategy: With a calculated PBO of 0.18, there is an 18% likelihood that the impressive backtest results of our strategy are influenced by overfitting. This number presents a conundrum: while not indicative of a high risk, it is not negligible either. In scenarios where tolerance for risk is lower, even an 18% probability might be unacceptable, necessitating a more conservative approach to strategy validation.

Regression Analysis of Sharpe Ratios: Unveiling Overfitting

Approach: To further validate the trading strategy and probe for overfitting, a regression analysis of Sharpe ratios was conducted. This analysis utilizes the in-sample (IS) Sharpe ratios, derived from each parameter combination within the PBO framework, and their corresponding out-of-sample (OOS) Sharpe ratios.

Methodology:

Data Preparation: We aggregate the IS and OOS Sharpe ratios obtained during the PBO assessment process.

Statistical Regression: A linear regression is performed with IS Sharpe ratios as the independent variable and OOS Sharpe ratios as the dependent variable. This establishes a quantitative relationship between the two sets of data.

The regression results reveal a negative correlation between IS and OOS Sharpe ratios. This indicates that higher IS Sharpe ratios tend to be associated with lower OOS Sharpe ratios, suggesting a propensity for the strategy to overfit to historical data.

Interestingly, running the Sharpe regression of 2018–2020 IS on 2021 OOS from our optimization step yields following nice results:

For the 2018–2021 IS and 2022 OOS the result is a bit worse:

Conclusion: Despite the mixed results from different periods, the initial regression demonstrates a general trend of overfitting. This is corroborated by the negative slope observed in the scatter plot, emphasizing the need for caution when interpreting high IS Sharpe ratios. Given the robustness of the PBO approach and the regression findings, the decision is to prioritize the initial Sharpe regression outcome as a more reliable indicator of the strategy’s predictive validity.

Beyond the Sharpe Ratio: Embracing Probabilistic Measures

The Traditional Sharpe Ratio’s Limitation: The Sharpe Ratio, while a widely-used measure of risk-adjusted return, relies on the Central Limit Theorem and assumes a normal distribution of returns, which may not always hold true, especially in hedge funds and alternative investments. As a point estimate, the Sharpe Ratio does not account for the shape of the return distribution, potentially masking non-normal characteristics such as skewness and excess kurtosis.

Inadequacy for Non-Normal Distributions: Returns in financial markets often exhibit skewed distributions, as depicted in the comparison between a typical Commodity Trading Advisor (CTA) trend-following program with positive skew and a typical options seller program with negative skew. Despite identical Sharpe Ratios, their risk profiles are markedly different due to the direction and degree of skewness — a factor the traditional Sharpe Ratio does not capture.

Incorporating Skewness and Kurtosis: The Probabilistic Sharpe Ratio (PSR), an innovation by Bailey and López de Prado, addresses this shortcoming. By integrating skewness and kurtosis, the PSR provides a probability estimate that the true Sharpe Ratio of a strategy exceeds a chosen benchmark, offering a nuanced view of the strategy’s performance.

Essentially, we are asking “Considering the distribution of our returns, how sure can we be that our Sharpe lies above certain value?”

Practical Application of PSR: In our analysis, by setting a benchmark Sharpe Ratio of 2, we applied PSR to the top five in-sample (IS) parameter specifications derived from the 2018–2022 training set. The PSR values ranged from 91% to 97%, indicating a high probability (at least 90%) that the observed IS Sharpe Ratios are indeed greater than 2. This is a heartening indicator of robustness, suggesting that the strategy’s risk-adjusted returns are not only statistically significant but also practically significant.

Conclusion: The Probabilistic Sharpe Ratio elevates the assessment of investment strategies by accounting for the shape of return distributions. This advanced metric assures us that the attractive Sharpe Ratios are not a mirage but have a high probability of indicating true superior performance, thus affirming the strategy’s promise.

The Deflated Sharpe Ratio: A Reality Check for Multiple Testing

The Allure of High Sharpe Ratios: Achieving a high Sharpe Ratio in backtesting can often be perceived as a triumph, a sign that the strategy may deliver superior risk-adjusted returns. However, this initial success may be concealing a statistical trap known as the Multiple Testing Problem.

Multiple Testing Problem Explained: The issue arises when a vast number of strategies are tested, increasing the odds of stumbling upon impressive results by sheer luck. This is akin to flipping a coin multiple times and eventually getting a run of heads, which, while notable, doesn’t imply a winning strategy but rather the play of probabilities.

The Deflated Sharpe Ratio (DSR): To mitigate this deceptive optimism, López de Prado and Bailey introduced the Deflated Sharpe Ratio. The DSR adjusts the Sharpe Ratio by accounting for the inflationary effect of multiple testing, the non-normality of returns, and the potential brevity of the data sample. It is designed to assess the statistical significance of a strategy’s Sharpe Ratio, offering a more sobering and accurate performance appraisal.

*Degradation of DSR for the given Sharpe with the number of independent trials*

Empirical Findings Using DSR: When the DSR was applied to the top five performing parameter sets with a benchmark Sharpe Ratio of 2, the results were startlingly different from the Probabilistic Sharpe Ratio (PSR). The DSR indicated only a 2% to 5% probability that these strategies genuinely outperform the benchmark, a stark contrast to the 94%-97% confidence levels suggested by the PSR.

Interpretation and Conclusion: The stark discrepancy between the PSR and DSR outcomes suggests that our strategy may indeed be a victim of the Multiple Testing Problem, with the initially high Sharpe Ratios potentially inflated by the sheer volume of optimizations rather than reflecting true predictive power. This underscores the critical importance of accounting for multiple testing when evaluating the performance of backtested strategies to avoid being misled by spurious results.

Conclusion: Synthesizing the Evidence of Overfitting

Evaluating Overfitting Risks: In the quest to develop a robust trading strategy, we’ve employed a battery of analytical techniques, each contributing a unique perspective on the strategy’s potential overfitting. Below is a summary table with an overfitting score assigned to each technique, where a score of 10 suggests a high probability of overfitting, and a score of 1 suggests a negligible risk:

Synthesis of Results: The collective insights from these methods point towards a notable risk of overfitting in our strategy. While the PSR initially suggested a low risk, the high scores associated with other methods, particularly the DSR, underscore the importance of applying multiple validation techniques to capture a comprehensive risk profile.

What’s Next: The critical query now is: what will the Out-Of-Sample (OOS) performance unveil about our strategy’s true efficacy? This will be the focus of the second part of our exploration, where we’ll dissect the OOS results and delve into strategies to mitigate overfitting risks.

A Call for Feedback: I hope you’ve found the first part of this analysis insightful. As this is my inaugural article, I warmly invite any feedback or suggestions. Feel free to connect and share your thoughts on LinkedIn at Alexander Demachev or in comments below.