Deep Dive (Part 2) — Analyzing your Trading Performance with Statistics
This is part 2 in a multi-part series on evaluating trading strategies. For part 1, click here.
Part 2 will discuss a couple common mistakes that traders make when backtesting, which metrics are more “noise” than “signal,” and how to utilize them properly.
Note: none of the information in this blog should be considered financial advice. Do not invest more than you’re comfortably willing to lose.
Metrics and their Role
In order to measure our trading performance we need to choose from any number of available metrics, but which ones are most telling? In a previous blog series we discussed Risk Management and how to determine whether or not a trade was worth taking, and therein we discussed several metrics for evaluating risk. It’s definitely worth taking a look if you haven’t already.
But when it comes to backtesting and evaluating performance, we need more than just risk metrics — we need performance metrics.
These metrics need to tell us more than just the risk we’re assuming — we need to know how our model performs over time, what the best and worst case scenarios were during testing, whether or not it outperforms the “buy and hold” method, how volatile the strategy is, and much more.
As discussed in Part 1, it is important to backtest across numerous timeframes and windows to ensure our model isn’t inherently biased by the long-term upward trend of the markets. It is tempting to begin with the average performance of your model across numerous windows, but there’s a major drawback to using the average.
Average vs. Median Performance
Many of you will know this already, but the mean (average) of a series will be influenced by outliers. For example, the average income for a population is generally considered a poor metric for evaluating what “normal” individual earns in a given year, because some people are unemployed and earn $0 annually, and others are fortunate enough to earn millions, tens of millions, or even hundreds of millions of dollars a year.
These outliers strongly influence the average, but to get an idea of what the “middle” actually is, we want to use the median instead. The median is exactly what it sounds like — given a number of backtesting results, if you order them from worst to best performance, the median is the result that is exactly in the middle.
In Part 1 we investigated the average performance of our strategy:
However we took the average performance of all strategies, which were influenced by the few times that our strategy made quite a bit of profit. Even with our eyes we can just see that there are a high number of windows in which our strategy returned a loss for the backtesting window.
What happens when we highlight the median performance versus the average performance?
We can see that if you were to line up all the results from lowest to highest, the middle result is still roughly -3% ROI. In other words, half of the time the model performed worse than -3%, and half the time it performed better.
What does this mean? For starters, it means that the “impressive” 5% ROI we talked about in Part 1 is actually far less impressive once we start to investigate further.
“But what about the median BTC performance?” I can hear you asking. Let’s take a look:
When looking at the median BTC performance, buy and hold is still a better option than our strategy. Is there any redemption for our trading strategy, then?
Volatility as a Measure
Let’s assume that one day you wake up, drink the perfect amount of coffee, and build an excellent strategy whose median performance outperforms the index by a wide margin — is it time to deploy?
Well, depending on your time frame and how long you plan on keeping your bot live, the volatility of your strategy will have a major influence on performance.
If you find that your median performance is highly positive but your strategy swings wildly between profit and loss, you might not want to deploy just yet. The volatility measure of a trading strategy’s returns is usually measured in standard deviations.
For those who are unfamiliar, the standard deviation is the amount of variation within a set of values. To calculate the standard deviation of a set of backtests, we first determine the average return over a series of backtests.
Once we have the average performance, we subtract from each window’s result the average return, and then square it.
With these results, we add them up and divide by the total number of backtesting windows (i.e. take the mean of the above results) and then take the square root of the mean, giving us the standard deviation. What is nice about this metric is that the standard deviation is expressed in the same units as the contributing data, so if you’re measuring the SD for strategy performance in percentage ROI, your SD will also be measured in percentage ROI.
If your strategy has a low standard deviation, it means that its performance is relatively stable, whereas a high standard deviation means that the performance is all over the place. Let’s take a look at the median of backtesting two different strategies that we haven’t seen before. These strategies are fundamentally different than the ones we’ve seen before, but are much more stable in their performance.
We can see that one strategy has most of its results clumped in a specific range, whereas the other is quite volatile. These two strategies are fundamentally different in how they operate, yet one seems to not only perform better but also be more volatile.
Strategy 1’s backtesting performance across 25 windows resulted in the average performance of 0.9% and a standard deviation (SD) of 1.50%, while Strategy 2 had an average performance of 0.62% and a SD of 1.95%
But what does this mean exactly? This means that for Strategy 1, you can assume that the majority (roughly 70% if you assume the performance of each backtest falls into a bell curve) of backtest results (and therefore possibly future results) will be plus or minus 1.50% from the average, giving a window of -0.60% to +2.40% ROI per month.
For Strategy 2, that window is wider, from -1.33% all the way up to 2.57%, which could indicate that Strategy 1 will have more consistently positive returns than Strategy 2.
In this case, which should be deployed? That is something you must determine yourself, as you must determine what your appetite for risk is (click here to learn more about risk management) and what your time frame is.
In the next post we will start diving deeper into proper backtesting window selection, whether you should backtest with or without replacement, and how this will lead to one of our most powerful tools: the binomial test, and estimating future performance.
In the meantime, be sure to check out ArcTaurus — an automated no-code solution for building cryptocurrency trading bots. We allow you to build and deploy custom strategies without writing a single line of code! Check out our website and our Linktree for more information, and get started on building powerful trading strategies today!