Algo trading backtest shenanigans part 1

10 min readAug 15, 2022

Algorithmic science is one of the world’s most accurate and fact-based subjects. Yet, In algo-trading/quant trading, one of the biggest lies you will encounter when presented with a model’s theoretical performance is front and center and called — backtesting.

Backtesting, at least in theory, should be the assessment and evaluation of a given model’s theoretical performance. “If I would have traded stocks based on the predictions of my model from year X up to year Y, then I would have had a yearly return of Z percent. So- trading based on the current predictions of the model, I should get the same yearly return.” If this logic works reliably, we’ll be able to know which models are worth using for investing.

The backtest methodology separates the good from the bad in everything related to model selection and training methodologies. It usually ranges from fraudulent and self-delusional to meticulous and exhausting.

Make no mistake — doing it the “right” way is very cumbersome.

I stopped counting how many hours we spent in FinityX, ensuring we were doing this the right way!

Let’s say you had a new idea you wanted to test out. Can you really hold off from trading on a model with an excellent return graph fresh out of the ML oven? In this article, we will go through adequately assessing the performance of this specific training methodology.

The best way to do so is to look at the model’s performance on the training set. Graph 1 shows our new model’s cumulative profit on the training dataset. This model takes certain features and creates an investment strategy for daily trading on the S&P 500 ETF. You’re probably wondering what features were used to get this result. We’ll talk about that later.

Graph 1: train dataset

Would you invest in this?

In this series, I want to explore and shine a light on the most common mistakes people make when evaluating the theoretical performance of their trained models. By the end, you will not get tricked by a great graph or, even worse- believe you have the golden goose and watch as your savings disappear.

Part 1: The Classic Train/Validation/Test split

First, it’s important to get a good intuition on what overfitting really is. Imagine an intelligent yet lazy student capable of learning complex ideas but also memorizing answers. If you test the student on problems he already knew from the textbook, he will likely achieve a perfect score. Does that mean he genuinely understands the subject? Of course not. He probably just memorized the answers and wrote them.

On the other hand, testing the student with problems that he didn’t encounter while seeing good results can make you believe that he truly understands the subject. Doing this would be a “train-test split”.

So what’s the deal with the validation set?

Ok, I lied. The computer is not like a student and, depending on the problem, will often, with enough time, reach the point of “memorizing”. Unlike the student, it doesn’t distinguish between memorizing and understanding, And does not consider doing so — “cheating.”

So what should we do? We introduce a new partition called validation which monitors the level of memorizing vs. understanding. The validation set acts as a proxy for the test set. By tracking the performance on the validation set, we can stop the training When performance on the validation set becomes worse, as the performance on the training set keeps improving. At this point, the model will stop trying to understand general and global principles and will start memorizing principles unique to the training set alone.

As training continues, we would want to save the version of the ML model that understood the problem before it “went too far” to memorize.

After we finished training the model, and kept the version that performed best on the validation set- we get to test it on the … wait for it … the test part of the split. Now you have a reliable, reproducible result.

Assessing the performance of a model based on training data is what we would call “overfitting”. Overfitting based on the training set is the most basic pitfall when assessing a given model’s performance.

Part 2: Backtest or Backtrain?

The performance in Graph 1 Looks too good to be true because it is. What can mistakenly be thought of as a backtest is actually what we should call a “backtrain”.

As we discussed, it is well established that you need a training/validation/test split. Still, the devil lies in the details, so there are several questions that we need to answer to make sure we get reliable results and assess performance when presented with a new investment opportunity properly.

1. Train/Validation/Test split — Does everybody use it?

Absolutely not. The field of algo-trading is plagued with con men with nice graphs selling snake oil. Some are very well aware of it, and some are not. How do you differentiate? It all comes down to asking the right questions. You need to ask and examine the backtesting methodology in question. So, let’s take a look at the model’s performance on the validation set:

Graph2: validation

Many would really consider this a backtest

Performance on the validation set — is it a proper backtest?

Many researchers would consider the performance in Graph 2 to be a proper backtest for the strategy. As we’ll see, it can be very misleading. But what’s wrong? The validation set holds data unseen by the model during training. While it is true that the validation data is unseen by the model, our chosen model was assessed by the validation set, which means that the standard training session already took the performance on the validation set into account.

This means we can’t look at the validation set’s performance as a theoretical real-time performance. Since this set is intrinsic to the whole training process, it’s possible that the model we chose only did well in this specific period, which is why it was selected in the first place.

What about k-fold or nested cross-validation?

Assuming that you think your model will not have the best performance forever and that you probably need to re-train it periodically to adjust for new environmental behaviors, it is recommended to do K-fold cross-validation.

Classic K fold cross-validation partitions the data to K with different allocations of training and validation and estimates the model’s performance based on the mean performance of the models on all K validation sets.

When evaluating models based on time-series data, like the stock market, it would be wise to use a K-fold strategy called “rolling backtest”. Doing a rolling backtest means training a model up to a specific historical date, leaving (for example) the next year as a validation set and the rest of the data as a test set. You then save the best model under this specific partition and restart the training from scratch, but every data set moves one year forward in time.

This method should give you a “simulation” of what would have happened in real-time if you had trained the model from scratch every year, leaving one year for validation. In addition, this technique should give you a sense of the robustness of the model. You can see how retraining it for an additional year at a time affects the performance.

Is it enough?

Generally NO.

Here is when we get to the main point. Many algorithm developers and data scientists already know the basic principles of everything we have already discussed. So there’s no problem with this scheme of training. It has a solid theoretical basis and should work. Theoretically, a model with excellent performance under these conditions should perform very well when deployed.

The real problem is YOU.

Let’s say you have a new idea that you want to train and test. You use the rolling backtest scheme to ensure the model’s performance is reliable, and you get a bad result. Will you start over from scratch? Probably not. You would probably want to hand-tweak a few things and try again. I believe that most people who try ML to find a successful model for algo-trading will not stop at the first failure and will try to tweak things with many iterations for the same overall idea.

If you do it enough times, you are essentially cheating by picking and choosing the best result. But what’s the problem? You’ll probably ask. The rolling backtest is a robust method. It gives you the performance on data not seen by the model. How can simply tweaking different aspects create great, consistent results on a rolling backtest? Well, I would argue that even a randomly generated model can get good backtest results with enough “trial and error”. If you keep trying to tweak minor things until you succeed, you will get an excellent backtest, but it won’t work in the future. In fact, the model we examined throughout this article, that traded on the S&P 500 ETF, used randomly generated features.

The thing is, people will often mistake consistency with reliability. They’re sure that if their backtest shows consistent results, then it has to be because the model understands the market and found a solid strategy. Graph 3 shows such a model’s performance on the test set, using 50 different tries and taking the best one. After enough trial and error of you getting to see the test set, you can get amazingly consistent results that would make you believe that you actually found a winning strategy.

Still, trading using this strategy for the next few months will yield random results since the model didn’t really learn anything. You can see the results in Graph 4. This is how you can waste precious time and money testing a model with no grip on reality.

Graph 3: best test from 50 tries

If at First, You Don’t Succeed, Try, Try Again?

Graph 4: paper trading

In the end, this is what you will get.

The first question that comes to mind is, Does this problem plague other fields?

Absolutely yes

Well then, what is the solution?

Well, the obvious one is just to add another test period. We’ll call it a “paper trading period”. In this period, we examine the model’s performance on live data without actually using money until we get conclusive results on data that no one has ever seen. It is highly recommended, but it’s not enough.

The main problem remains. With enough so-called “trial and error”, you will get some good results. We didn’t solve the YOU problem. To really get over that, and to be sure that you actually designed something useful, you need to statistically analyze the pass/fail rate of “good” models according to the test set and “good” models according to the “paper testing period”.

If, for example, you have trained 100 models that do well on the rolling backtest scheme, but only five do well in the paper testing period- choosing these specific 5 for trading will get you back to square one. It is essentially the same as manual trial and error on the test set. If this is the case, you should question your initial design and consider starting from scratch.

But, if you have 100 models that do well on the rolling backtest scheme, and a substantial number of them do well when paper-testing, then you probably have something worthwhile.

It’s a challenging and time-consuming working scheme, but it is necessary and will probably save you more time than constantly paper-testing random models that perform well on the test set. You need to see a statistically significant amount of models that performed well on the test period and are performing well on the paper period, including all of your past failures. Filtering bad models based on the paper trading phase is tempting but will create more overfitting.

It’s also important to note that in the case of algo-trading, adding another phase of testing the model with real money will be very beneficial. It will help you to properly assess deployment costs, slippage costs, etc.

To summarize:

A backtest for the performance of a given model doesn’t mean anything without a rigorous methodology behind it. As a researcher or a potential client, you need to examine the backtest methodology for any possible holes, so you’ll be able to distinguish between great business opportunities and scams properly.

The proper way to do so is to use a train/validation/test split in a nested cross-validation scheme with the addition of a paper trading period and reliable statistics of the percentage of models that perform consistently. Do each of these steps, and throw in a live trading test. Got all of those? Excellent. Follow the next article for more pitfalls. If you don’t have all of those, feel free to contact us.

This article is a part of a series of articles on various subjects and fields by the FinityX’s team.

We believe that sharing knowledge and helping others is a part of our essence and our ability to thrive in a world full of investment (but not only) opportunities.

If you like this, please feel free to look for more content from our team on our Linkedin page:

And please feel free to ask me any question and discuss those matters with me: