Backtesting bias and how to avoid it

8 min readJan 21, 2020

Backtesting is the process of calculating how an investment or trading strategy would have fared historically. It is widely used to test and reject models and strategies.

But too often tested strategies fail in real-time use. This can be due to any number of reasons, such as a flawed strategy, dependence upon correlations which cease to exist, low-quality data, market or liquidity issues, or bias in the backtesting process.

In the following section, we will attempt to shed some light on the normal problems that can occur when doing historical research, and how we prevent such problems from occurring at Weekly Stocktip.

Survivorship Bias

One of the most common backtesting biases is survivorship bias. A lot of low-quality & cheaper data sets commonly used for backtesting only contain companies that survive over time and, hence, report above-average results. Our system explicitly includes all companies in the period that are later liquidated, sold, merged, goes into chapter 11, becomes a micro-cap, etc. No data survivorship bias of any form takes place.

Insufficient Sample Bias

Many researchers observe a single relationship and draw a conclusion from it. This is statistically unacceptable, of course. In order to form a conclusion about an observation, quite a large number of observations need to be present.

Look Ahead Bias

Look ahead bias is the process where one calculates a result which is biased by the fact that one already has some degree of knowledge of events that will later come to pass, whether minuscule or significant. This is a common problem in historical studies, rendering some of these more or less useless.

In order to avoid look-ahead bias, the human element needs to disappear from the equation. A human being cannot disregard that which he already knows.

We accomplish this by performing asset selection using a cognitive computing process.

Cognitive computing means a computing system capable of making “human-like” decisions, or decisions which appear similar to those made by humans, except that they are not.

By having a computer select assets — in a double-blind, randomized fashion, similar to the test methodologies used to evaluate pharmaceutical products where lives are at stake — we ensure that no look ahead bias is possible.

You might guess that this entails that our system is at an information disadvantage relative to any human being making decisions at the same time. This is sort of true — as humans we can access information from many more sources in our environment than a machine, but we are not very good at processing the information we receive. In fact, the computer and its programming are much better at making decisions that are successful and objectively correct more often than humans are.

Of course, we have engineered our calculation environment so that it only knows what was publicly disclosed knowledge at the date of any decision rendered. In this, we observe the US S.E.C. laws and regulations, and base any decision only upon data actually filed with the S.E.C. at the data decision point.

Therefore, it is absolutely impossible for our calculation environment to know of events that had not occurred at the time of the decision (the data is simply not there; human error is impossible; and human intervention is impossible — due to a strict set of rules that prevent all human decision interaction in the analysis). Consequently, no look ahead bias can take place.

Timelessness and Temporal Bias

Most models used in the financial worlds today exhibit what we call temporal bias. We are here referring to the fact that most data or data points in the financial world do have some form of temporal element associated with them.

Consider the most common model: a correlation between data from asset A and data from asset B. A “researcher” finds a correlation between these and believes that the link between A and B can be used as a part of a model. Of course, it cannot. In doing so the researcher exhibits data-mining bias, temporal bias, insufficient sample bias, all at once, and what he builds will fail spectacularly in the future.

Here we are interested in what we call temporal bias only. A correlation between A and B may exist. But it will be of a temporary nature only, generally speaking at least.

Using a correlation in an investment model, and backtesting the model, is not what we consider science. This goes not only for correlations, but for many other forms of relationships and data points — approx. 95% of financial data. If a relationship changes over time (let’s say we define time as 1,000 years), then it simply cannot be used in a model which attempts to predict what happens in the future, for the obvious reason that markets do change. Relationships change. Correlations change. Opinions change. Most things are subject to ebb and flow.

Our approach is instead to restrict our models so they never observe things which cannot be considered timeless.

This, of course, makes the process of building a model about 1,000 times more complex, as only relatively few concepts are timeless. But it also makes the ensuing economic model 1,000 times more durable and reliable.

Here are a few examples of things that are timeless: reversion to the mean as a concept. Overshooting. Physics. Mathematics. Fear. Greed. To name a few.

Some concepts are everlasting, and we submit that these are the only concepts that can be used in a model.

Data Release Timing Bias

In addition to preventing look-ahead bias, we have conducted tests that effectively delay information compared to the time it was released historically. Information is delayed by arbitrary amounts (days, months, quarters), and results confirm that the overall performance is not sensitive to data which was very recent at the time of any decision. In other words, performance is impacted vaguely, but stays ahead of the curve, even with “not-exactly-fresh” data.

Data-Mining Bias

When developing backtests, it is possible that one could mold or sculpt a method to “fit” the actual occurrences of financial history, as we know it.

In doing so, it may (likely) arrive at a wrong conclusion. For instance, if one tested for negative correlations between gold and the US dollar in a short period of time, it could lead to the strongly erroneous conclusion that the price of the dollar sets the price of gold.

The principal way to avoid data-mining is to not search for what works.

Proper science forms a thesis, which would logically work, and then ONLY test whether or not the thesis can be proven to work or not.

Using this approach limits the risk of data-mining bias, as no “mining” takes place. Rather, the scientific method is used.

Another approach is to always use long-term periods and insist on many samples. We use 50 years. Doing so makes it less likely that any pattern will emerge by chance. It is far less likely that a pseudo-random or temporary pattern will fit well throughout a very long period, as opposed to a shorter period.

Also, sample size matters a great deal.

Second, our methods are based on sound business acumen and 20 years of experience from the business world; they are not based on arbitrary data. They work not only in practice, but also in theory, and they are derived from better business practices, not investment or trading practices, as is usually the case.

Simply put, avoiding data-mining, avoids data-mining bias.

Issues with Market Capitalization, Liquidity, or Moving the Market

Any investment method has to contend with the fact that market participants move the market.

Each company has a finite amount of shares, and sufficiently large investments would transform the prices at which these trade and, hence, invalidate historical data. (History would appear different if more participants performed certain actions, contrary to what they did in actual history).

Numerous methods can be used to avoid this. One of these is to ensure that we only use candidates with reasonable liquidity, that is, trading volume in the period in question. This is because thinly traded issues tend to be more easily influenced by market demands.

Another method is to restrict the market capitalization, thus removing micro-cap shares from consideration, based on the fact that larger capitalization stocks are influenced less by increased demand.

Third, and most importantly, liquidity is restricted from trading more than a small percentage of the liquidity traded historically in order to minimize market impact.

Optimal Period Bias

We have seen several studies which were made within a time frame at which they excelled, and conveniently did not include time spans when they performed poorly.

We avoid such behavior and emphasize the maximum possible length of our studies, even though yearly performance would obviously be significantly higher if we selected sub-periods in which our methods excel.

Also, some studies have used 10- or 20-year periods, which seem too short for statistically sound evaluation. We believe 50 years are sufficient.

Friction/Trading Costs

An argument posed against a number of studies with seemingly sound results is friction costs, that is, trading-related costs, which would in the end have a negative effect on performance.

This is no longer the issue it used to be, with equity trading costs at a possible USD 0.005 a share using direct access brokers with some even being “free”.

However, we have still put emphasis on it, ensuring that our approach only trades once a week, and in general the turnover in our portfolios can be described as extremely low. Hence, none of our methods are hurt by friction costs, in a meaningful way.

Conclusion

Backtesting as a concept may in some quarters have a less than stellar reputation.

This is entirely undeserved. As a concept it is an intrinsic part of the scientific method. A way to measure a thesis accurately and properly, if used correctly.

It is the people who misuse data and models, both in the academic and in practitioners’ worlds, who are to blame for why some backtested models fail to operate sufficiently well in real time. They were either built or tested improperly.

In 2008 we completed a paper on our investment methods at Weekly Stocktip in which we predicted a future performance of approx. 36% return p.a. would be possible.

In 2009, 2010, and 2011 we conducted a pilot project, where we invested on the basis of our research. Our results were 42.75% p.a. The results have been audited by a state-licensed auditor.

Our most recent models have also performed on par with this 2008 prediction, as the average results since 2009 have been a CAGR of 40.10%. Recent performance can be observed in the graph below.

Backtesting is the process of calculating how an investment or trading strategy would have fared historically. It is widely used to test and reject models and strategies.

Weekly Stocktip performance vs. Dow Jones Industrial Average

It can be observed that, if properly built, an investment model can perform in the future, exactly as it has done in the past.

Access more of our investment research or subscribe to our investment models at https://weeklystocktip.com