Yet another architecture of ML crypto trading system.

Roman Pashkovsky



The purpose of this article is to share an approach to developing a machine-learning system for crypto trading. I hope this helps someone who has the same task and interest. I would also like to see feedback if anyone knows what are the downsides to this approach and how they can be fixed.

So enjoy reading and have fun.


The task, in essence, is to create an algorithm that

  • makes a prediction of a trading action (e.g. buy or sell)
  • for a trading pair (e.g. BTC-USDT) on an exchange (e.g. Binance)
  • at some point in time (e.g. every 1 second or when some event occurs)
  • based on data (e.g. prices, volumes, news, etc) from previous points in times (e.g. in the last 1 minute)
  • and
  • places an order on an exchange based on predicted action.

Of course, the algorithm can be extended to multiple pairs and/or on multiple exchanges, predicting an action and placing an order for each exchange-pair (e.g. for portfolio optimization and/or arbitrage), but here, for simplicity, one pair on one exchange is considered.

Let’s call the part of algorithm that makes predictions a MODEL and the part that places orders an AGENT. Thus, the task can be decomposed into a pipeline, where the MODEL receives data and predicts the action, the AGENT receives the action and places an order on the exchange.

Task decomposition

The decomposition allows us to divide the task into separate subtasks with different abstractions, goals and tools for their implementation (i.e. MODEL focuses on the quality of predictions using various mathematical libraries, where AGENT focuses on reliability and speed of order placing using mainly engineering libraries).

A MODEL, in general, can be created in many ways, from manual rules and technical indicators to complex statistical and ML models.

An AGENT can be not only an algorithm that uses the API to place orders, but also a human using the UI of the exchange or both in parallel.


This article focuses on a hybrid time series (TS) and reinforcement learning (RL) algorithm (TS-RL), but any other algorithm can be used instead.

The algorithm consist of two parts:

  1. Forecasting of price(s) by TS model,
  2. Prediction of action by RL model using forecast from TS.
Model decomposition

As in the task decomposition, which allows us to use the appropriate tools for price prediction (TS) and decision-making (RL), because they are different in the training process and libraries, at least for now. It also gives us the opportunity to train a large TS model on a large amount of data and use it to train an RL model which is known to be sample inefficient and can take a long time to train on lots of data.

If one is now thinking about the analogy with ChatGPT training process (LLM + RL), yes I am doing so too.

RL also could be trained with offline buffer collected from AGENT actions on an exchange.


The agent is an adapter between predicted action and exchange orders.

Agent decomposition

It also dynamically manages orders and their states, i.e. check if orders are opened, closed, filled or cancelled and also closes them appropriately if the agent is stopped or killed for some reason.

It can also implement different reliability logic when:

  • the prediction is not available or outdated,
  • the exchange is not responding,
  • the network is too slow,
  • order rate exceeded,
  • etc.

MODEL in detail

In this article:

  • The Temporal Fusion Transformer (TFT) implemented in PyTorch Forecasting is choosen as the TS model;
  • The Proximal Policy Optimization (PPO) implemented in Stable Baselines 3 is choosen as RL model.
Input data and targets

The input data consists of 3 parts:

  1. Limited Order Book (LOB) with ask/bid prices and quantities by depth,
  2. TRADES with price, quantity and direction of the trade,
  3. NEWS with title, summary, authors and tags.

The FORECAST, in general, is a 3D TxHxQ array where:

  • T — number of targets,
  • H — number of horizons,
  • Q — number of quantiles.

For example, with one mid-price target, four horizons of 15, 30, 45, 60 seconds and five quantiles of 0.1, 0.25, 0.5, 0.75, 0.9, array shape will be 1x4x5, where item [0,1,3] is a prediction of 0.5 quantile (median) of mid-price on the 30-second horizon and item [0, 3, 4] — 0.9 quantile of mid-price on the 60-second horizon.

The ACTION is a string selected from a set of actions (e.g. buy, sell and hold).

There are also new steps in the pipeline related to the training of models, data processing and metrics collecting.

Logic and implementation

The pipeline consists of 6 main stages:

  1. Blue (Mining) stage, where RAW DATA is mined from LOB, TRADE and NEWS using feedparser and websocket-client libraries;
  2. Green (Data) stage, where FEATURES are extracted from RAW DATA and DATASET is created from FEATURES and/or PREDICTIONS;
  3. Yellow (Model) step where MODELS (from PyTorch Forecasting and Stable Baselines 3) are trained/tuned using DATASETS and then stored in the model registry (MLflow with S3 and PostgreSQL);
  4. Orange (Prediction) stage, where MODELS loaded from the model registry make predictions using DATASETS;
  5. White (Metrics) stage, where METRICS are collected from the pipeline;
  6. Gray (Trading) stage, where AGENT trades on an exchange using websocket-client library and collects trading METRICS.

Stages communicate using Kafka topics.

All data from Kafka topics are automatically stored in a DATABASE (InfluxDB) in a background process by the Telegraph application.

Configuration files are written in YAML format and parsed by the Hydra library.

TS-RL model logic is implemented as follows:

Logic and implementation for TS-RL
  1. RAW DATA is mined from LOB, TRADE and NEWS;
  2. FEATURES are extracted from RAW DATA;
  3. TS-DATASET is created from FEATURES;
  4. TS model (PyTorch Forecasting) trained/tuned using TS-DATASET;
  5. TS model makes FORECASTS using TS-DATASET;
  6. RL-DATASET is created from FORECASTS and/or FEATURES;
  7. RL model (Stable Baselines 3) trained/tuned using RL-DATASET;
  8. RL model makes prediction of ACTIONS using RL-DATASET.

The process consists of two loops: online and offline.

Process view

The offline loop is used for backtesting and/or training/tuning and collects data from DB.

The online loop is used for trading and collects data from Kafka topics.

  1. In the Blue (Mining) stage, RAW DATA is mined from data sources (LOB, TRADE, and NEWS) and sent to the appropriate Kafka topic. Also, RAW DATA automatically stored in the DB in the background from Kafka topics.
  2. In the Green (Data) stage:

FEATURES are extracted:

  • In the online loop, from RAW DATA topics and sent to FEATURES topic. Also, FEATURES automatically stored in the DB from FEATURES topic;
  • In the offline loop, from DB and stored in the DB;

DATASET is created:

  • In the online loop, from FEATURES and/or PREDICTIONS topic,
  • In the offline loop, from DB.

3. In the Yellow (Model) stage, MODELS are trained/tuned using DATASETS and stored in the model registry (MLFLOW);

4. In the Orange (Prediction) stage, PREDICTIONS are made by MODELS from MLFLOW using DATASET. Predictions are stored:

  • In the online loop, in a PREDICTIONS topic, and then automatically saved to the DB,
  • In the offline loop, in the DB.

5. In the White (Metrics) stage, METRICS are collected:

  • In the online loop, from FEATURES and PREDICTIONS topics and send to METRICS topic. Also, METRICS are automatically stored in the DB from METRICS topic;
  • In the offline loop, from DB and stored in the DB.

6. In the Gray (Trading) stage, the AGENT trades using data from Kafka topics (especially FORECASTS topic) and/or from the DB and collects metrics into the METRICS topic.

All parts of the pipeline are wrapped into Docker containers and can be deployed with an orchestrator or manually on one or more hosts.

Deployment view

All containers can run on a single host with a 4-core CPU, 8GB RAM, 300GB HDD, and optionally 4GB GPU for training models.

It is convenient to train models on a separate host.

AGENT in detail

Agent logic strongly depends on an exchange and communication protocol.

Here are some considerations for implementing AGENT on the Binance exchange with the websocket protocol:

  1. Use mixed websocket with market and user data in one (see the question at StackOverflow);
  2. Dynamically check if you are not exceeding API limits and wait or throw an exception if you are close to the limits;
  3. Dynamically check if MODEL predictions are not out of date, else wait or throw an exception;
  4. Implement exception handlers;
  5. Set websocket default timeout and implement WebSocketTimeoutException handler to prevent it from hanging;
  6. Implement SIGINT, SIGTERM and KeyboardInterrupt handlers;
  7. Forcibly close open position in handlers (e.g. with market order).

MODEL training

There are many ways to organize model training.

First, one can separate retraining from tuning: tuning means training the model from the previous iteration, retraining means training the new model at each iteration.

Retraining vs tuning

Let’s describe the training process in time steps:

  1. TRAIN-1 ended and produced a new MODEL-1. MODEL-1 begins working.
  2. TRAIN-2 begins with DELAY-2 on DATA-2 determined by WINDOW-2 size using MODEL-1 for tuning or starting from scratch for retraining. MODEL-1 continue working.
  3. TRAIN-2 ended after TRAINING-2 time and produced a new MODEL-2. MODEL-1 ends working. MODEL-2begins working.
  4. TRAIN-3 begins with DELAY-3 on DATA-3 determined by WINDOW-3 size using MODEL-2 for tuning or starting from scratch for retraining. MODEL-2 continue working. Note that there is a gap between DATA-2 and DATA-3 caused by short WINDOW-3 size or long TRAINING-2 time.
  5. TRAIN-3 ended after TRAINING-3 time and produced a new MODEL-3. MODEL-2 ends working. MODEL-3 begins working.
  6. TRAIN-4 begins with DELAY-4 on DATA-4 determined by WINDOW-4 size using MODEL-3 for tuning or starting from scratch for retraining. MODEL-3 continue working.
  7. TRAIN-4 ended after TRAINING-4 time and produced a new MODEL-4. MODEL-3 ends working.

Delays before training can be caused by: a human decision, waiting for some event to occur, specific task reasons, data drift, performance drops, etc.

Training data may or may not overlap depending on window size, training time and delay.

If delays are zero, the training pipeline is simplified. Now the new model is trained exactly after the previous model was trained.

Continual (without delay) retraining vs tuning

One could tune from some base model instead of last one. This is also called fine-tuning, especially if the training data for the base model is significantly larger than the data for the new model.

Retraining vs fine-tuning

One could tune from the retrained model instead of tuned one. It might be possible to eliminate the overfitting of the tuned model, especially on policy-based RL model.

Retraining vs tuning from retraining

There are an infinite number of ways to organize the training process, which can be based on human decisions, algorithms (including ML), or a combination of both.

Experiment setting

The experiment was conducted on Binance exchange on pair BTC-TUSD with zero maker/taker fees. The initial asset size is 240 TUSD equally divided between BTC and TUSD.

MODEL settings:

  1. TS trained once on data from July 2022 to October 2022 with Volume Weighted Average Price (VWAP) and mid-price features and a 1-minute horizon median mid-price as a target.
  2. RL continually trained during May 2023 in two ways:
  • First retrained every ~15 minutes with 30 minutes historical window with actual and predicted prices as features;
  • Second tuned every ~4 hours with 1 week historical window with actual and predicted prices as features.

AGENT settings:

  • AGENT places limit good-til-cancelled (GTC) orders into the orderbook with depth of 1-2 maximum spread in the last 30 minutes (to reduce market volatility);
  • If AGENT is not in position and the order is not filled in 3 seconds, then cancel the order;
  • If AGENT is in position and the order is not filled in 30 seconds then update the price of the order;
  • Order size is 400 µBTC (~10–12 TUSD depending on actual BTC price);
  • Prices are updated once per 1 second;
  • Model predictions become outdated if they are more than 30 minutes late.

Depending on the time, 8–10 AGENTS worked with different RL models and the price depth of the order book.

LOB data is collected once per second.

Prediction of MODEL is made once per 3 seconds.

Experiment results

Here are some metrics collected from month of trading by various AGENTS.

Return in TUSD by various models (last number is a depth of order price in the orderbook in units of spreads)
Price (upper), total return in TUSD (second), 1-week RL return (third), 30-minute RL return (fourth)
PNL from Binance UI

Main metrics:

  • PNL ~ 3 TUSD;
  • Initial asset size ~ 240 TUSD;
  • Final asset size ~ 243 TUSD;
  • Rate of return (ROR) ~ 3/240*100% = 1.25% per month;
  • Average return per round trade (buy-sell or sell-buy) ~ (27415.62–27415.42)*0.0004 ~ 80 µTUSD ~ 0.00008/~11*100% ~ 0.0007%;
  • Total number of trades ~ 16.98314 / 0.0004 + 16.98210 / 0.0004 ~ 42458+ 42455 = 84913 per month.

So, the metrics, although positive, are not so good, in particular:

  • You cannot trade with a fee higher than 0.0007%/2 = 0.0035%, which is significantly less than the default 0.1% fee on most crypto exchanges;
  • There were also huge drawdowns from May 7 to May 10 that reduced returns to zero. There is no guarantee that these drawdowns will not occur in the future;
  • A ROR of 1.25% per month (0.0007% per round trade) requires a large initial asset (trading volume) to generate significant returns in absolute terms, even when trading on margin.

Open questions

  1. TS model trained only on limited order book (LOB) data with depth 20, Volume Weighted Average Price (VWAP) and mid-price as features and median mid-price at the 1-minute horizon as a target.
  • Could other LOB-based features improve prediction quality (e.g. volumes, ratios, filters, technical indicators)?
  • Could features based on TRADES and/or NEWS improve prediction quality?
  • Could hyperparameters of TS model improve prediction quality?
  • Could horizon of prediction improve quality?

2. RL models were trained using the TS model forecast (price at a 1-minute horizon) and the last known price with a 1-minute historical window.

  • Could the size of the historical window improve trading quality?
  • Could the history of actions (i.e. buy or sell) and current position (i.e. long or short) improve trading quality?
  • Could hyperparameters of RL model improve trading quality?

3. Experiments were conducted in May 2023 with the static TS model trained on historical data from July 2022 to October 2022 and continual RL models of two ways: the first one retrained every ~15 minutes with a 30-minute historical window and the second tuned every ~4 hours with a 1-week historical window.

  • Could the size of historical data for training TS model improve prediction quality?
  • What if using continuous TS models like RL?
  • What is better to retrain or tune models?
  • What historical window size to choose?
  • What early stopping conditions to choose (e.g. time limit, sample limit, train metrics, test/validation metrics)?

4. There are much of AGENT parameters that could improve the quality of trading:

  • size of orders,
  • order price (i.e. how deep to place the order in the order book),
  • time to cancel/update an order if it is not filled yet,
  • stop-loss and take-profit ratios (or not use them at all).

5. The experiment lasted only 1 month, will the metrics be stable for a longer period?

6. How to compare trained/tuned/retrained models (e.g. distance in space of predictions on sample data, model weights, training metrics, etc)?


Thank you for reading the article, I hope this material was useful and worth the time spent.

The code is available on GitHub under the MIT license.

More visualizations and results are available on Google Slides.