A machine learning framework for algorithmic trading

16 min readJun 12, 2021

In the world of algorithmic trading, many practitioners are skeptical about the existence of successful machine learning(ML)-based strategies. They do not run out of arguments: “because of the low signal-to-noise ratio inherent to financial data, ML models can’t learn much from them and will easily overfit”, “financial data are unlikely to be IID (Independent and Identically Distributed) and most supervised ML models were created with that assumption in mind”, etc.

Are these reasons enough to make you abandon the idea of building your own ML-based profitable trading algorithm? Well it depends on the amount of effort you are willing to put in and most likely it won’t be easy. However, this is entirely possible.

First the naysayers are usually referring to the task of predicting the price of an asset in the near future based on the market data. This is quite restrictive. A good dose of domain knowledge, a bit of out of the box thinking and a tremendous amount of data collection and wrangling should lead to the formulation of other ML problems for which a solution could be part of a trading strategy. For instance, one could monitor and measure the activity on social media regarding a particular asset and try to predict future movements for that asset. Also, the target variable does not have to be the price. It could be anything from which you can decide if you should buy or sell a particular asset. While it might be sufficient for other domains, ML for algorithmic trading requires more than simple data wrangling and model fitting.

Also, one can look around and notice that some big actors in the field are finding success with ML. For instance, Renaissance Technologies LLC, one of the most successful hedge funds ever, is famous for only hiring mathematicians and physicists with little financial background and relies mostly on ML. One could argue that these are top notch experts who have access to an exceptional amount of resources in terms of data, infrastructure and computational power and that probably because of them there is not much alpha left. Well it depends on which markets and assets we are talking about.

Because it is still a relatively young market, and because of its highly speculative nature, it is hard to believe that the cryptocurrency market respects the Efficient Market Hypothesis (EMH). Therefore, as of today, there is probably room to generate alpha while trading cryptocurrencies even for the small fishes.

At AlphaGrow, we have developed a few profitable strategies to trade cryptocurrencies based on mathematics, statistics and machine learning. You can follow our trades by visiting our website: https://alphagrow.io

In this post, we will present an example of methodology to develop a successful ML algorithm that would be part of a trading strategy for cryptocurrencies. We will first select a dataset and then build a relevant target variable. We will disclose a way to optimise your parametric features. Then we will train and evaluate multiple XGBoost models in a walk-forward validation fashion, before presenting and discussing the results.

Here is the summary:

The data

The target variable

Features optimisation

XGBoost

Training and evaluation

Results

Conclusion

The data

In this work, we are using 1 year of market data (with a sampling period of 1 min) from June 1st, 2019 to May 31st 2020 for the BTCUSDT pair. This is an arbitrary choice and you could decide to follow the approach below with another period, another sampling period and another financial instrument.

Note that if you are looking for cryptocurrency market data, you can find a regularly updated dataset comprised of 250+ X-USDT and X-BTC pairs since 2017 here: https://alphagrow.io/data_sharing.html

The market data is composed of the usual information:

price (close, open, high, low)
volume

Market data sample from the AlphaGrow data lake

The target variable

Before we can proceed further, we need to define what we want to predict. The model should be able to predict anything that allows us to decide if we should buy the asset, sell it or not do anything.

This is a critical step because the target variable will influence many choices down the line.

We first define the take profit T/P and the stop loss S/L. At time t, we want to predict if T/P will be hit before S/L in the future. If that is the case, then we buy with the intention to sell when T/P is hit. If that’s not the case, we don’t do anything.

In other words, at time t, we look at all the t’>t and we compute the return between t and t’. By comparing that value with T/P and S/L we can compute the label as shown below.

Generation of positive labels: “buy at time t”

Generation of negative labels: “do not do anything at time t”

Basically, we end up with a binary target variable that will tell us if we should buy the asset with the hope that we will sell it via T/P or not do anything because there is a high chance we would hit S/L first.

For now, we won’t decide on any T/P or S/L values. This will be done later when we are fine-tuning the parameters of our features in order to maximise their predictive power towards the target variable.

Here it is important to note that for each time point t we use information from multiple data points occurring after t in order to compute the corresponding label. We decide to store in a variable the number of iterations necessary to compute the label of each time point t. We call this number “patience” because it quantifies how long we had to wait before the label could be assessed (either because the take profit was hit first, the stop loss was hit first or the end of the time series was reached). Keep this in mind because it will play an important role later.

We also need some form of KPI to know if the model is satisfactory or not. For instance, we can aim at having a positive expected return.
Let us translate that in terms of the model’s minimum acceptable performance.

In this context of binary classification, we note TP the number of true positives (predictions for which the classifier correctly estimates that we should buy) and FP the number of false positives (predictions for which the classifier tells us to buy while we shouldn’t). We can thus define the precision metric:

A high precision means that when the model predicts that we should buy, it is likely correct.

If we note f, the transaction fees, we can approximate the expected return:

Indeed, the precision p can be approximated as the percentage of buy orders that led to a profitable trade, with a profit equal to T/P. We have to factor in the fact that we paid fees for the buy order and for the sell order.
Then, 1-p is an approximation of the percentage of buy orders that were incorrectly suggested by the model and lead to an exit via S/L. Again, we have to take into account the fees.

We can infer an estimation of the minimum precision that would guarantee a positive expected return:

Features optimisation

Now that we have defined the target variable to predict, we need features that will be used as inputs to our model.

Because we are protective of our IP, we won’t disclose the full details regarding how our features are crafted.

What we can say is that our features are derived from rolling-based operations on price and volume. Multiple parameters have to be selected, including the ones corresponding to the sizes of the different rolling windows.

These parameters, as well as the take profit and the stop loss are fine-tuned so that the predictive power of each feature towards the target variable is maximised.

For this part, you can engineer your own features. Even if they are not parametric, this section remains relevant because it allows you to find optimal values for S/L and T/P that are used to define the target variable.

PPS metric to measure predictive power of indicators

Many practitioners use correlation (Pearson, Kendall or Spearman) to assess the predictive power of one variable towards another. However, this statistical metric is limited by its inability to detect non-linear, as well as asymmetric relationships. Besides, Pearson, Kendall and Spearman correlations are only meant for relationships between two numerical variables. Here, our target variable is binary and even some of the features could be binary or categorical.

To avoid these shortcomings, we decide to rely on another metric, the PPS (Predictive Power Score).

Let’s consider two variables var1 and var2. PPS(var1, var2) will take values between 0 (no relationship between both variables) and 1 (there is a direct relationship from var1 to var2).
If var2 is a binary variable, the F1 score of a Decision Tree Classifier (Model1) using var1 as sole feature and var2 as target variable is compared to the F1 score of a dummy model (Model2) that predicts the most common class of var2 for all inputs. The greater the F1 score of Model1 compared to Model2, the higher the PPS score between var1 and var2. In other words, a high PPS score between var1 and var2 indicates that knowing the value of var1 gives us significant information regarding the value of var2.
If var2 is a numerical variable, Model1 becomes a Decision Tree Regressor, the metric of interest becomes MAE (Mean Absolute Error), and the dummy model returns the median value of var2 for all inputs.
A well documented python module is accessible and easy to use.

Multi-objective optimisation to optimise the parameters of the features

The next step is to optimise the parameters of our features, as well as S/L and T/P so that the PPS value between each feature and the target variable is maximised

You should recognise the definition of a multi-objective optimisation problem usually framed as follows:

Obviously maximising a certain function f is equivalent to minimising -f, therefore the formulation above is general enough. Here we want to optimise M (equal to the number of features) PPS functions that depend on N variables composed of the parameters of the features, T/P and S/L. We could also include J inequality constraints and K equality constraints.

We decide to solve this with pymoo, a Python library that offers a framework for single and multi-objective optimisation.

The first step is to create an object belonging to (or extending) the class “Problem” whose purpose is to define the M functions to optimise, the boundaries of the N variables as well as the constraints if you decide to add some.

The second step is to choose a relevant optimisation algorithm. This choice depends on the type of problem (e.g. single-objective or multi-objective). For multi-objective optimisation, the library gives the choice between two genetic algorithms: NSGA-II and R-NSGA-II. Genetic algorithms are meta-heuristics that are inspired by the processes of natural selection, more specifically they mimic the mechanisms of mutation, crossover and selection. Mutation is about slightly altering, and thus diversifying, the parameters from one generation (iteration i) to another (iteration i+1) with the hope that some of the alterations yield better optima. Crossover is when the parameters between parents of the same generation (iteration i) are combined to produce new offsprings (iteration i+1). With selection, the weakest members (the ones that yield the worst optima) of the same generation (iteration i) are discarded, so that only the “fittest” ones are allowed to breed to produce the next generation (iteration i+1). We arbitrarily decide to proceed with NSGA-II.

A termination criteria also needs to be selected. There are multiple options including the number of generations (iterations) of the algorithm, the execution time of the algorithm, the number of generations after which the N variables have stopped evolving significantly (according to a value x_tol) or the number of generations after which the M objective functions have stopped improving significantly (according to a value f_tol).
Here, we will stop the algorithm if none of the M PPS scores have improved by 0.05 after 3 consecutive generations.

As far as the implementation is concerned, we will let the reader familiarise with the library that has a very clear and easy to follow documentation.

After several iterations, we are left with multiple sets of parameters that are on the Pareto front. This means that none of the set is superior to the other because from one set to another, there is at least one feature for which the PPS score is better. We select the set that maximises the harmonic mean between all the optimised PPS scores. And within that set, we remove all the features for which the PPS score is considered too small (below 0.1).

It gives us the following values of take profit and stop loss, which will be used to generate the target variable:

T/P = 0.0362
S/L = 0.0103

And thus, we have a set of features with their corresponding labels.

Our anonymised 16 features along with the target variable

XGBoost

Ensemble methods are techniques that produce multiple simple models (“weak learners”: models that are slightly better than a random guess) and combine them in a way that creates a model that is better than any of the simple models.

Boosting is one type of ensemble technique where weak learners are being trained iteratively so that each weak learner compensates for the weaknesses of the previous weak learner.

Over the last few years, XGBoost has become one of the most popular algorithms because of its speed, how it performs in general and the simplicity of using it. It is basically an open source implementation of gradient boosting, which relies on the boosting framework.

That’s the model we will be using in this article, but it might be worth exploring other options.

Training and evaluation

Now that we have a matrix of features X, the corresponding labels y and a model, we can proceed with training and evaluation.

K-fold cross validation and its limitations

Usually, k-fold cross validation is one of the go-to methods to evaluate a machine learning model. In its roughest form, it consists in creating k train-test split of the data so that all the data points have been placed in a test set once only. Thus, we end up with k models that are evaluated separately. The overall performance score is the results of the aggregation of the k scores (usually the mean, but it could be something else depending on what makes the most sense). The purpose of the technique is to have a better estimate of the ability of the model to generalise on unseen data.

K-fold cross-validation data split with k=4

Because of the nature of the problem we are trying to solve, using it would not be rigorous.
The first reason, which is the least problematic, is that we would end up with data points in the training sets that occur after some of the data points in the test set. In production, we will always use past data to predict future data.
The second reason is more concerning. Our target variable is built such that consecutive labels will be highly correlated. Therefore, by doing a vanilla k-fold cross validation we would end up with data points in the training sets that contain information from the future and thus we would have deceptively positive results.

Walk-forward cross validation with purging

We decide to use walk-forward cross validation with a sliding window instead. Basically we will end up with multiple train-test splits so that for each split:

the test set occurs after the training test
each training set is comprised of N1 data points, and each test set is comprised of N2 data points. We use N1 = 60*24*30*3 which equates to 3 month, and N2 = 60*24*30*1 which equates to 1 month
the starting point of the i-th train-test block is shifted by a step S = N2 data points compared to the starting point of the (i-1)-th train-test block

Walk-forward cross validation with sliding window

At this stage, there is still a limitation with the approach. Indeed the labels of the data points in the training set that are the closest to the test set temporally are very likely to have been computed with information from the first data points of the test set. This is a case of look-ahead bias that can lead to overfitting.
To fix this, we will do “purging”, which simply means removing the data points from the training set for which the labels were computed with data points that are in the test set. You remember the object containing all the “patience” values that we created earlier while we were generating the target variable? Well, it was meant for this moment. We decide to remove from all the training sets all the data points for which the patience is such that it contains information from the future. In other words, for these data points, returns from the future (from the test set potentially) were used. Therefore, we remove the data points that occur at a time t so that t + patience_t is in the test set or beyond.

Walk-forward cross validation with purging

Another key item to note is that given the way our target variable was computed, consecutive labels are very likely to be redundant with each other because they are function of the same returns. Our labels are thus non-IID and because of that, using a model that assumes IID will lead to oversampling.
In our complete in-house ML pipeline we have developed sampling techniques to make each training set closer to IID which tend to improve the generalisation error of the models. We won’t disclose them in this article because they are part of our IP, but we encourage the reader to think about this issue and explore potential solutions.

We are now ready to train and evaluate a XGBoost model for each train-test split that we have obtained. The main KPI defined earlier is to have a precision that is above (S/L+2*f)/(T/P+S/L), which would guarantee having positive expected returns. Therefore, we want each model to satisfy this criteria.

Results

At the end of the features and target variable optimisation step, we ended up with:

a take profit of 3.62%
a stop loss of 1.03%

Besides, we assume that we are trading on Binance and using BNB to pay for the fees, so that the fee rate is f = 0.075%.

According to the formula we developed earlier, we would need the precision p to be above 0.253 to guarantee that our expected return is positive.

Even though we are mostly focusing on precision, we decide to also measure the recall (TP is the number of true positives and FN the number of false negatives) which gives an indication on the amount of opportunities that the model misses (the lower the recall, the more buying opportunities we miss):

Since we are using 1 years of data, the parameters selected for the walk forward cross validation (train a model on the last 3 months, evaluate on the next month and retrain at the end of that month etc.) yield 9 periods, and therefore 9 models.

Below are our results. The precision scores on the different test set range from roughly 0.3 to 0.5, which matches our criteria. The recall scores indicate that we have a significant amount of false negatives, which means that we are missing a good number of buying opportunities. This is still better than taking too many incorrect bets that make you lose money.

Precision and recall scores on the test sets for each XGBoost model trained during the walk forward forecasting

Conclusion

Coming up with a successful machine learning trading strategy definitely requires some work.

In this post we have explained the development process of a trading signal with a ML model that tells us when to buy, combined with an optimised take profit that tells us when to sell the BTCUSDT pair. The idea here was not to give a complete recipe that will guarantee you success, but it was to share some interesting ideas and concepts that we use internally in combination with other advanced techniques, and that you could use for inspiration to come up with your own ML-based strategies.

The performance that we have managed to achieve should in theory guarantee us a positive expected return. But in reality, other steps are missing to better assess the profitability of the approach in a production environment.
For instance, this signal should be backtested on other periods, especially periods with specific market regimes (e.g. bull vs bear).
Furthermore, if we want to use it for other pairs, we should redo the entire procedure with a basket of all our pairs of interest, and make sure that we find a configuration of parameters that suit our satisfaction criteria.
More importantly, here the criteria of success is rather simplistic on purpose since our main goal was to focus on the ML part. One should rather use a metric like the Sharpe ratio that not only takes into account the returns, but also the risk.

A quick note about us

At AlphaGrow, we are dedicated to help you grow your portfolio while boosting your trading revenues thanks to an in-house fully automated trading system that is hosted on a robust cloud infrastructure. Machine learning is only one of the tools at our disposal to achieve that mission. We also rely on other forms of statistical methods, mathematics and computer science techniques.
Our team of passionate quantitative analysts is constantly working on new strategies. If you are interested in learning more about our strategies and if you want to exchange ideas, feel free to contact us (see below) 🙂 🚀

How to contact us: contact@alphagrow.io

Our website: https://alphagrow.io

Our cryptocurrency market data lake with 250+ pairs (BTC and USDT) over 4 years: https://alphagrow.io/data_sharing.html