Yet another architecture of ML crypto trading system.
Intro
The purpose of this article is to share an approach to developing a machine-learning system for crypto trading. I hope this helps someone who has the same task and interest. I would also like to see feedback if anyone knows what are the downsides to this approach and how they can be fixed.
So enjoy reading and have fun.
Task
The task, in essence, is to create an algorithm that
- makes a prediction of a trading action (e.g. buy or sell)
- for a trading pair (e.g. BTC-USDT) on an exchange (e.g. Binance)
- at some point in time (e.g. every 1 second or when some event occurs)
- based on data (e.g. prices, volumes, news, etc) from previous points in times (e.g. in the last 1 minute)
- and
- places an order on an exchange based on predicted action.
Of course, the algorithm can be extended to multiple pairs and/or on multiple exchanges, predicting an action and placing an order for each exchange-pair (e.g. for portfolio optimization and/or arbitrage), but here, for simplicity, one pair on one exchange is considered.
Let’s call the part of algorithm that makes predictions a MODEL
and the part that places orders an AGENT
. Thus, the task can be decomposed into a pipeline, where the MODEL
receives data and predicts the action, the AGENT
receives the action and places an order on the exchange.
The decomposition allows us to divide the task into separate subtasks with different abstractions, goals and tools for their implementation (i.e. MODEL
focuses on the quality of predictions using various mathematical libraries, where AGENT
focuses on reliability and speed of order placing using mainly engineering libraries).
A
MODEL
, in general, can be created in many ways, from manual rules and technical indicators to complex statistical and ML models.An
AGENT
can be not only an algorithm that uses the API to place orders, but also a human using the UI of the exchange or both in parallel.
MODEL
This article focuses on a hybrid time series (TS
) and reinforcement learning (RL
) algorithm (TS-RL
), but any other algorithm can be used instead.
The algorithm consist of two parts:
- Forecasting of price(s) by
TS
model, - Prediction of action by
RL
model using forecast from TS.
As in the task decomposition, which allows us to use the appropriate tools for price prediction (TS
) and decision-making (RL
), because they are different in the training process and libraries, at least for now. It also gives us the opportunity to train a large TS
model on a large amount of data and use it to train an RL
model which is known to be sample inefficient and can take a long time to train on lots of data.
If one is now thinking about the analogy with ChatGPT training process (LLM + RL), yes I am doing so too.
RL
also could be trained with offline buffer collected fromAGENT
actions on an exchange.
AGENT
The agent is an adapter between predicted action and exchange orders.
It also dynamically manages orders and their states, i.e. check if orders are opened, closed, filled or cancelled and also closes them appropriately if the agent is stopped or killed for some reason.
It can also implement different reliability logic when:
- the prediction is not available or outdated,
- the exchange is not responding,
- the network is too slow,
- order rate exceeded,
- etc.
MODEL in detail
In this article:
- The Temporal Fusion Transformer (TFT) implemented in PyTorch Forecasting is choosen as the
TS
model; - The Proximal Policy Optimization (PPO) implemented in Stable Baselines 3 is choosen as
RL
model.
The input data consists of 3 parts:
- Limited Order Book (
LOB
) with ask/bid prices and quantities by depth, TRADES
with price, quantity and direction of the trade,NEWS
with title, summary, authors and tags.
The FORECAST
, in general, is a 3D TxHxQ array where:
- T — number of targets,
- H — number of horizons,
- Q — number of quantiles.
For example, with one mid-price target, four horizons of 15, 30, 45, 60 seconds and five quantiles of 0.1, 0.25, 0.5, 0.75, 0.9, array shape will be 1x4x5, where item [0,1,3] is a prediction of 0.5 quantile (median) of mid-price on the 30-second horizon and item [0, 3, 4] — 0.9 quantile of mid-price on the 60-second horizon.
The ACTION
is a string selected from a set of actions (e.g. buy, sell and hold).
There are also new steps in the pipeline related to the training of models, data processing and metrics collecting.
The pipeline consists of 6 main stages:
- Blue (Mining) stage, where
RAW DATA
is mined fromLOB
,TRADE
andNEWS
using feedparser and websocket-client libraries; - Green (Data) stage, where
FEATURES
are extracted fromRAW DATA
andDATASET
is created fromFEATURES
and/orPREDICTIONS
; - Yellow (Model) step where
MODELS
(from PyTorch Forecasting and Stable Baselines 3) are trained/tuned usingDATASETS
and then stored in the model registry (MLflow with S3 and PostgreSQL); - Orange (Prediction) stage, where
MODELS
loaded from the model registry make predictions usingDATASETS
; - White (Metrics) stage, where
METRICS
are collected from the pipeline; - Gray (Trading) stage, where
AGENT
trades on an exchange using websocket-client library and collects tradingMETRICS
.
Stages communicate using Kafka topics.
All data from Kafka topics are automatically stored in a DATABASE
(InfluxDB) in a background process by the Telegraph application.
Configuration files are written in YAML format and parsed by the Hydra library.
TS-RL
model logic is implemented as follows:
RAW DATA
is mined fromLOB
,TRADE
andNEWS
;FEATURES
are extracted fromRAW DATA
;TS-DATASET
is created fromFEATURES
;TS
model (PyTorch Forecasting) trained/tuned usingTS-DATASET
;TS
model makesFORECASTS
usingTS-DATASET
;RL-DATASET
is created fromFORECASTS
and/orFEATURES
;RL
model (Stable Baselines 3) trained/tuned usingRL-DATASET
;RL
model makes prediction ofACTIONS
usingRL-DATASET
.
The process consists of two loops: online and offline.
The offline loop is used for backtesting and/or training/tuning and collects data from DB.
The online loop is used for trading and collects data from Kafka topics.
- In the Blue (Mining) stage,
RAW DATA
is mined from data sources (LOB
,TRADE
, andNEWS
) and sent to the appropriate Kafka topic. Also,RAW DATA
automatically stored in theDB
in the background from Kafka topics. - In the Green (Data) stage:
FEATURES
are extracted:
- In the online loop, from
RAW DATA
topics and sent toFEATURES
topic. Also,FEATURES
automatically stored in theDB
fromFEATURES
topic; - In the offline loop, from
DB
and stored in theDB
;
DATASET
is created:
- In the online loop, from
FEATURES
and/orPREDICTIONS
topic, - In the offline loop, from
DB
.
3. In the Yellow (Model) stage, MODELS
are trained/tuned using DATASETS
and stored in the model registry (MLFLOW
);
4. In the Orange (Prediction) stage, PREDICTIONS
are made by MODELS
from MLFLOW
using DATASET
. Predictions are stored:
- In the online loop, in a
PREDICTIONS
topic, and then automatically saved to theDB
, - In the offline loop, in the
DB
.
5. In the White (Metrics) stage, METRICS
are collected:
- In the online loop, from
FEATURES
andPREDICTIONS
topics and send toMETRICS
topic. Also,METRICS
are automatically stored in the DB fromMETRICS
topic; - In the offline loop, from
DB
and stored in theDB
.
6. In the Gray (Trading) stage, the AGENT
trades using data from Kafka topics (especially FORECASTS
topic) and/or from the DB
and collects metrics into the METRICS
topic.
All parts of the pipeline are wrapped into Docker containers and can be deployed with an orchestrator or manually on one or more hosts.
All containers can run on a single host with a 4-core CPU, 8GB RAM, 300GB HDD, and optionally 4GB GPU for training models.
It is convenient to train models on a separate host.
AGENT in detail
Agent logic strongly depends on an exchange and communication protocol.
Here are some considerations for implementing AGENT
on the Binance exchange with the websocket protocol:
- Use mixed websocket with market and user data in one (see the question at StackOverflow);
- Dynamically check if you are not exceeding API limits and wait or throw an exception if you are close to the limits;
- Dynamically check if
MODEL
predictions are not out of date, else wait or throw an exception; - Implement exception handlers;
- Set websocket default timeout and implement WebSocketTimeoutException handler to prevent it from hanging;
- Implement SIGINT, SIGTERM and KeyboardInterrupt handlers;
- Forcibly close open position in handlers (e.g. with market order).
MODEL training
There are many ways to organize model training.
First, one can separate retraining from tuning: tuning means training the model from the previous iteration, retraining means training the new model at each iteration.
Let’s describe the training process in time steps:
TRAIN-1
ended and produced a newMODEL-1
.MODEL-1
begins working.TRAIN-2
begins withDELAY-2
onDATA-2
determined byWINDOW-2
size usingMODEL-1
for tuning or starting from scratch for retraining.MODEL-1
continue working.TRAIN-2
ended afterTRAINING-2
time and produced a newMODEL-2
.MODEL-1
ends working.MODEL-2
begins working.TRAIN-3
begins withDELAY-3
onDATA-3
determined byWINDOW-3
size usingMODEL-2
for tuning or starting from scratch for retraining.MODEL-2
continue working. Note that there is a gap betweenDATA-2
andDATA-3
caused by shortWINDOW-3
size or longTRAINING-2
time.TRAIN-3
ended afterTRAINING-3
time and produced a newMODEL-3
.MODEL-2
ends working.MODEL-3
begins working.TRAIN-4
begins withDELAY-4
onDATA-4
determined byWINDOW-4
size usingMODEL-3
for tuning or starting from scratch for retraining.MODEL-3
continue working.TRAIN-4
ended afterTRAINING-4
time and produced a newMODEL-4
.MODEL-3
ends working.
Delays before training can be caused by: a human decision, waiting for some event to occur, specific task reasons, data drift, performance drops, etc.
Training data may or may not overlap depending on window size, training time and delay.
If delays are zero, the training pipeline is simplified. Now the new model is trained exactly after the previous model was trained.
One could tune from some base model instead of last one. This is also called fine-tuning, especially if the training data for the base model is significantly larger than the data for the new model.
One could tune from the retrained model instead of tuned one. It might be possible to eliminate the overfitting of the tuned model, especially on policy-based RL
model.
There are an infinite number of ways to organize the training process, which can be based on human decisions, algorithms (including ML), or a combination of both.
Experiment setting
The experiment was conducted on Binance exchange on pair BTC-TUSD with zero maker/taker fees. The initial asset size is 240 TUSD equally divided between BTC and TUSD.
MODEL
settings:
TS
trained once on data from July 2022 to October 2022 with Volume Weighted Average Price (VWAP) and mid-price features and a 1-minute horizon median mid-price as a target.RL
continually trained during May 2023 in two ways:
- First retrained every ~15 minutes with 30 minutes historical window with actual and predicted prices as features;
- Second tuned every ~4 hours with 1 week historical window with actual and predicted prices as features.
AGENT
settings:
AGENT
places limit good-til-cancelled (GTC) orders into the orderbook with depth of 1-2 maximum spread in the last 30 minutes (to reduce market volatility);- If
AGENT
is not in position and the order is not filled in 3 seconds, then cancel the order; - If
AGENT
is in position and the order is not filled in 30 seconds then update the price of the order; - Order size is 400 µBTC (~10–12 TUSD depending on actual BTC price);
- Prices are updated once per 1 second;
- Model predictions become outdated if they are more than 30 minutes late.
Depending on the time, 8–10 AGENTS
worked with different RL models and the price depth of the order book.
LOB
data is collected once per second.
Prediction of MODEL
is made once per 3 seconds.
Experiment results
Here are some metrics collected from month of trading by various AGENTS
.
Main metrics:
- PNL ~ 3 TUSD;
- Initial asset size ~ 240 TUSD;
- Final asset size ~ 243 TUSD;
- Rate of return (ROR) ~ 3/240*100% = 1.25% per month;
- Average return per round trade (buy-sell or sell-buy) ~ (27415.62–27415.42)*0.0004 ~ 80 µTUSD ~ 0.00008/~11*100% ~ 0.0007%;
- Total number of trades ~ 16.98314 / 0.0004 + 16.98210 / 0.0004 ~ 42458+ 42455 = 84913 per month.
So, the metrics, although positive, are not so good, in particular:
- You cannot trade with a fee higher than 0.0007%/2 = 0.0035%, which is significantly less than the default 0.1% fee on most crypto exchanges;
- There were also huge drawdowns from May 7 to May 10 that reduced returns to zero. There is no guarantee that these drawdowns will not occur in the future;
- A ROR of 1.25% per month (0.0007% per round trade) requires a large initial asset (trading volume) to generate significant returns in absolute terms, even when trading on margin.
Open questions
TS
model trained only on limited order book (LOB
) data with depth 20, Volume Weighted Average Price (VWAP) and mid-price as features and median mid-price at the 1-minute horizon as a target.
- Could other
LOB
-based features improve prediction quality (e.g. volumes, ratios, filters, technical indicators)? - Could features based on
TRADES
and/orNEWS
improve prediction quality? - Could hyperparameters of
TS
model improve prediction quality? - Could horizon of prediction improve quality?
2. RL
models were trained using the TS
model forecast (price at a 1-minute horizon) and the last known price with a 1-minute historical window.
- Could the size of the historical window improve trading quality?
- Could the history of actions (i.e. buy or sell) and current position (i.e. long or short) improve trading quality?
- Could hyperparameters of
RL
model improve trading quality?
3. Experiments were conducted in May 2023 with the static TS
model trained on historical data from July 2022 to October 2022 and continual RL
models of two ways: the first one retrained every ~15 minutes with a 30-minute historical window and the second tuned every ~4 hours with a 1-week historical window.
- Could the size of historical data for training
TS
model improve prediction quality? - What if using continuous
TS
models likeRL
? - What is better to retrain or tune models?
- What historical window size to choose?
- What early stopping conditions to choose (e.g. time limit, sample limit, train metrics, test/validation metrics)?
4. There are much of AGENT
parameters that could improve the quality of trading:
- size of orders,
- order price (i.e. how deep to place the order in the order book),
- time to cancel/update an order if it is not filled yet,
- stop-loss and take-profit ratios (or not use them at all).
5. The experiment lasted only 1 month, will the metrics be stable for a longer period?
6. How to compare trained/tuned/retrained models (e.g. distance in space of predictions on sample data, model weights, training metrics, etc)?
Outro
Thank you for reading the article, I hope this material was useful and worth the time spent.
The code is available on GitHub under the MIT license.
More visualizations and results are available on Google Slides.