Supervised learning is all you need for algorithmic trading

Published in

Geek Culture

9 min readFeb 26, 2023

Loss and profit across different methods for the asset: XMR/BTC

Recently, researchers have been utilizing complex decision-making techniques to address a problem that, as we will describe, can be better solved using supervised learning. In the past decade, there has been a remarkable increase in the use of Reinforcement Learning (RL) based methods that combine deep learning methods with RL techniques. These recent advancements, especially in games, have generated a tremendous hype surrounding deep reinforcement learning (DRL) methods. One specific application that has received a lot of attention is the use of RL methods in financial asset trading.

The main strategy of algorithmic trading is to automatically use past price-related variables to determine whether to buy or sell an asset or assume a long or short position. Algorithmic trading typically involves an active management of trades, with the investor changing their position multiple times in a short period. The discrete decision variable is commonly described as:

decision variable for the position

where X is the decision variable encoded as -1 or 1. Managing a single asset may be straightforward in terms of the required decision variable. However, portfolio management can be more complex. Even in the case of portfolio management, if we actively manage the portfolio, we can break the problem down into multiple single asset trades. Additionally, strategies such as pairs trading also rely on past time-series data to make each decision.

To solve such decision-making processes, various mathematical models can be used to model the problem and then solved using methods such as RL. Stochastic control is another possible approach to solving these problems, and its notation will be adopted here. The goal is to maximize profit, which is the accumulated agent return over a given period after a sequence of decisions, as expressed by:

The decisions in this context are made by a policy that depends on the current state and can be either approximated or tabular. When an approximated method is employed, the parameters θ must be estimated.

In financial active asset trading, the state can be defined by multiple state variables. The most common state variables used are related to the past price time-series. This type of state is called an informative state because it cannot be affected by the agent’s decisions.

The usual assumption is that the investor cannot influence the price returns with their position. As a result, the transition of states depends solely on exogenous information, represented by W. The exogenous information may depend on the current state, but it does not depend on the agent’s decision. The following equation can express this concept:

In the case of single asset trading, it is usually assumed that the exogenous information may depend on the current state, but it does not depend on the agent’s action. The exogenous information is further incorporated into the state variable.

This type of problem is called a contextual bandit problem which is better explained in the book of Richard Sutton. Not only there, but recent works classify active trading as a contextual bandit problem [1], [2]. Recent works [3], [4], using RL in trading usually employ state-of-the-art RL methods to solve full RL, which were not developed to deal with contextual bandit problems. In this context, a technique such as bootstrapped Thompson sampling would be adequate.

The bootstrapped Thompson Sampling is a good heuristic for balancing exploration and exploitation, which helps to minimize profit loss. Bootstrapped Thompson Sampling may be useful for solving contextual bandit problems, mainly because it helps to minimize the profit loss due to a lack of knowledge about the distribution of stochastic variables. However, this is not the primary issue in this context. The exploration-exploitation dilemma arises when it is difficult to determine the best decision to make in the present, often due to changes in the physical state that can affect future contributions. In the case of trading, the physical variable does not change, making it easier to find the best decision at a given moment.

When not considering transaction costs, the best decision is the decision that gives a positive return. Therefore, the policy can be found by training a simple supervised learning algorithm. When considering a discrete decision variable, we only require to find a good time-series classifier, for which we can recommend the paper of Hassan Ismail. Even with transaction cost, a simple heuristic can be used to overcome the transaction cost problem, such as considering a future cumulative return (of size F) for the label of the best action.

The mathematical representation of the cited heuristic for the labels is:

Let’s test our premisse in an time-series of cryptocurrency assets.

Experiment

We created a simulated environment of a financial market that assumes that the investor (agent) cannot affect future prices.

Using this environment, we employ some of the most common state-of-the-art RL methods in that environment to try to find a policy that generates profit. The stablebaselines implement the RL methods.

We also suggest two other techniques as a comparison:

The bootstrapped Thompson sampling
A Resnet LSTM actor (RSLSTM-A), which is a time-series classifier.

Using all those techniques, we can only say that we are wrong if RL is better in most cases.

State variables and contribution

The state variable here uses only the past 50 price returns of the close price. Therefore, it is a fair test, even more for RL due to stability reasons.

RL suffers from stability when we deal with noisy data. Also, as the state variables grow dimensionality, it is even harder for RL to find a good policy. Therefore we use a very simple state variable for comparison:

where r is the asset price return and Mis the lookback window size.

We assume that the contribution on each step depends on the decision and the state variable (our information variable):

where δ is the transaction cost of 0.002 (very common in the literature).

Data

We choose to use cryptocurrency for two main reasons:

It is a time-series with a plentiful supply of free data with high frequency. Additionally, the time-series are not affected by weekends or end-of-market events, as cryptocurrency can be traded 24/7.
Speculative nature and the role of human behavior: Another reason for choosing cryptocurrency is its speculative nature, which means that the asset price is more susceptible to fluctuations based on human behavior and market sentiment. The speculative nature of characteristic contrasts with assets such as bonds or commodities, which may be less influenced by human behavior and more by economic factors or supply and demand dynamics. By focusing on a more speculative asset, it is believed that the role of human behavior in determining asset prices will be more pronounced, and this will provide more information about how human behavior impacts asset prices. This characteristic, in turn, makes it easier to surpass a simple buy-and-hold strategy, as past asset prices will contain more information about future states, allowing for more informed decision-making.

The time-series data can be freely obtained in many places such as:

Home

Free Historical Data We provide a wide array of historical cryptocurrency data for FREE. Our time series data sets use…

www.cryptodatadownload.com

Results

The following table present the results comparing RL againts the two proposed contenders:

Table with the annualized returns for different assets comparing different strategies.

The results expressed in the table are the relative percentage of increase (or decrease) to the buy and hold strategy.

To compare the algorithms, we use the average ranking of the models for each asset. We then compare the average rankings using the Wilcoxon signed-rank test to verify the statistical significance of the differences. Using the Wilcoxon signed-rank test and considering the data employed, we can confidently state that RSLSTM-A is better than all the RL methods. Additionally, we could not determine whether there is a statistically significant difference between RSLSTM-A and BTS; however, we can see that the average rank of RSLSTM-A is much lower.

The following figures show a sample of the tested asset’s profit and loss evolution. We can see the predominance of non-RL techniques, aways surpassing the B&H:

Loss and profit across different methods for the asset: DASH/USD

Loss and profit across different methods for the asset: BTC/USD

The presented results prove that RL may not be necessary, or even an optimal choice, for trading algorithms. Two potential causes are at play:

RL is prone to convergence difficulties when working with noisy data. Therefore, a necessary step would be to preprocess the data, which can be challenging as it can be difficult to isolate the decision variables from the noise.
It is challenging to accurately assess the generalization capacity, as the training data is limited in terms of the number of steps. Papers in this area often use a validation dataset to monitor the generalization capacity as the model trains, which aligns more with the supervised learning framework.

For a deeper analysis of the results and more details regarding the contextual bandit framework, check our paper.

Code

The code is available at:

GitHub — leokan92/Contextual-bandit-Resnet-trading: Using Resnet architecture in the contextual…

The interdisciplinary relationship between machine learning and financial markets has long been a theme of great…

github.com

To reproduce and extend this work, you just need to execute each asset training and test in the single-run.py , change the asset as you desire and have fun.

Reference

To reference our work please use:

@article{FELIZARDO2022117259,
title = {Outperforming algorithmic trading reinforcement learning systems: A supervised approach to the cryptocurrency market},
journal = {Expert Systems with Applications},
pages = {117259},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.117259},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422006339},
author = {Leonardo Kanashiro Felizardo and Francisco Caio {Lima Paiva} and Catharine {de Vita Graves} and Elia Yathie Matsumoto and Anna Helena Reali Costa and Emilio Del-Moral-Hernandez and Paolo Brandimarte},
keywords = {Deep neural network, Reinforcement learning, Stock trading, Time series classification, Criptocurrencies},
abstract = {The interdisciplinary relationship between machine learning and financial markets has long been a theme of great interest among both research communities. Recently, reinforcement learning and deep learning methods gained prominence in the active asset trading task, aiming to achieve outstanding performances compared with classical benchmarks, such as the Buy and Hold strategy. This paper explores both the supervised learning and reinforcement learning approaches applied to active asset trading, drawing attention to the benefits of both approaches. This work extends the comparison between the supervised approach and reinforcement learning by using state-of-the-art strategies with both techniques. We propose adopting the ResNet architecture, one of the best deep learning approaches for time series classification, into the ResNet-LSTM actor (RSLSTM-A). We compare RSLSTM-A against classical and recent reinforcement learning techniques, such as recurrent reinforcement learning, deep Q-network, and advantage actor-critic. We simulated a currency exchange market environment with the price time series of the Bitcoin, Litecoin, Ethereum, Monero, Nxt, and Dash cryptocurrencies to run our tests. We show that our approach achieves better overall performance, confirming that supervised learning can outperform reinforcement learning for trading. We also present a graphic representation of the features extracted from the ResNet neural network to identify which type of characteristics each residual block generates.}
}