Deep Reinforcement Learning for Crypto Trading

Part 3: Training

Published in

Coinmonks

13 min readMay 17, 2024

Disclaimer: The information provided herein does not constitute financial advice. All content is presented solely for educational purposes.

Introduction

This is the third part of my blog post series on reinforcement learning for crypto trading:

This article explains the reinforcement learning algorithm, neural network architecture and the training process.

As I mentioned in Part 0: Introduction, my main goal is to connect with potential employers or investors and ultimately become a professional quant.

Resources

Reinforcement learning library

The essential question is which library to use for reinforcement learning (RL). I decided to choose RLlib from Ray. RLlib is an open-source library that supports production-level, highly distributed RL workloads while maintaining unified and simple APIs. It has wider functionality, more implemented RL algorithms, and is better supported than other RL libraries, in my opinion. RLlib supports both the most popular deep-learning frameworks: PyTorch and TensorFlow. RLlib algorithms (such as “PPO” or “IMPALA”) allow you to set the num_workers config parameter, such that your workloads can run on hundreds of CPUs thus parallelizing and speeding up learning. RLlib auto-vectorizes gymnasium.Envs via the num_envs_per_worker config parameter. Environment workers can then batch and thus significantly speedup the action computing forward pass. RLlib’s API stack built on top of Ray offers off-the-shelf, highly distributed algorithms, policies, loss functions, and default models (including the option to auto-wrap a neural network with LSTM or attention net).

RL algorithm

Due to the ability of RL agents to outperform their human competitors in many challenging tasks, the development of new RL algorithms has gained substantial attention in recent years. Nowadays, many RL algorithms are available, to name a few — DDQN, VPG, A3C, SAC, TRPO, DDPG, TD3, ACTR, MPO, PPO. A full list of RL algorithms supported by RLlib can be found here. No universal RL algorithm can outperform all other algorithms in every possible RL environment. Instead, some algorithms are designed for fast learning in fully observable environments with dense rewards. In contrast, other algorithms are designed to focus on slower but consistent learning in partially observable environments with sparse rewards which is our case.

Algorithms classification

After we decided which library to use, the next question was what type of algorithms best suited our needs. RL algorithms can be divided into model-free vs model-based and on-policy vs off-policy categories. Crypto markets are not similar to board games like chess, where we can hard-code all the rules and their known outcome. We don’t have a model for them. This way, we can filter out all model-based algorithms, such as: Dyna-Q, Monte Carlo Tree Search (MCTS), Prioritized Sweeping, Adaptive Dynamic Programming (ADP), Model Predictive Control (MPC), etc.

On-policy and off-policy refer to two broad categories of reinforcement learning algorithms. The key difference between them lies in how they generate and learn from the data:

On-Policy Reinforcement Learning Algorithms: On-policy algorithms learn from the current policy. In other words, these algorithms learn the value function according to the current policy and improve the policy based on the learned value function (relatively smaller replay buffer). For instance, SARSA (State-Action-Reward-State-Action) or PPO (Proximal Policy Optimization) are on-policy algorithms. They use the current policy to decide their future actions and then update it based on the learned values.
Off-Policy Reinforcement Learning Algorithms: Off-policy algorithms, on the other hand, can learn from a different policy from the one they are currently executing. This means they can learn from experiences collected from past policies (relatively larger replay buffer). This allows the agent to learn from a wide range of experiences, not just the current ones. An example of an off-policy algorithm is Deep Q-Learning, where the learning process involves looking at the maximum reward possible for the next state, irrespective of the action taken under the current policy.

In summary, on-policy algorithms are algorithms where the agent learns the value function according to the current action derived from the policy currently in use. In contrast, off-policy learns from actions taken outside of the current policy.

However, off-policy algorithms might be less effective in non-stationary environments (i.e., the underlying dynamics change over time), like the cryptocurrency market. Off-policy algorithms are a good choice if you have a lot of historical data and the market dynamics don’t change too rapidly. On the other hand, if the market is highly volatile and changes rapidly, on-policy could be more effective due to its ability to learn and adapt continuously.

Examples of on-policy model-free algorithms:

A3C (Asynchronous Advantage Actor-Critic) uses two neural networks: an actor that decides which action to take and a critic that determines how good the action was. The actor updates the policy based on the advantage (the difference between the actual return and the estimated return), and the critic estimates the return.
TRPO (Trust Region Policy Optimization) modifies the policy by setting a specific region in which the policy can be updated. The policy update is ensured not to be too big to prevent instability.
PPO (Proximal Policy Optimization) combines the best of both worlds from A3C and TRPO. It tries to limit the policy change at each update to avoid harmful large updates.

Proximal Policy Optimization (PPO)

We will be using PPO throughout this blog series, its design ensures a stable and consistent improvement of the policy and value networks over the course of the entire training procedure. When PPO is provided with sufficient training time, it eventually converges to a better final policy than other RL algorithms. PPO also supports discrete action space, which aligns with our trading strategy.

In the Proximal Policy Optimization algorithm, two neural networks are typically involved:

Policy network (Actor): the Policy network is responsible for generating the actions for the agent. Given the state of the environment, the Policy network outputs a probability distribution over the available actions. The agent then selects the action based on this distribution. The training aims to adjust the Policy network parameters to produce actions that lead to higher rewards. Policy network solves a classification problem. When the observation state (flattened 1D feature vector) is received as input, the Policy network predicts one of four possible classes (actions).
Value network (Critic): the Value network is used to estimate the expected return (future rewards) of being in a given state or taking an action in a given state. This is used to compute the Advantage function, which measures how much better an action is compared to the average action in that state. The Advantage function is used to update the Policy network in a direction that improves the expected return. Value network solves a regression problem. It receives a feature vector and outputs a single number (Advantage).

The Policy network and Value network work together to allow the agent to learn an optimal policy that maximizes the expected return. The Policy network proposes actions, and the Value network critiques these actions, providing a signal that is used to improve the policy. We need only a Policy network (Actor) for inference.

In some implementations of PPO, a single network with two sets of output layers (one for the policy and one for the value function) is used instead of two separate networks. This can be adjusted using the parameter training.model.vf_share_layers in config.py.

The PPO algorithm uses two main mechanisms to ensure that the updated policy is not far away from the current policy to avoid instability:

Clipped Probability Ratio: PPO proposes an objective function that limits the new policy from deviating too much from the old policy. It does this by clipping the policy ratio (the ratio between new and old policy probabilities). The clipping ensures that the ratio stays close to 1, and the new policy remains ‘proximal’ to the old policy. This reduces the risk of harmful updates.
Adaptive KL Penalty: PPO also uses a mechanism called Adaptive KL Penalty, which constrains the step size of the policy update. It checks the Kullback-Leibler (KL) divergence between the new and old policy. If the KL divergence is above a certain threshold, the new policy tries to deviate too much from the old policy, and the update is penalized.

PPO config

My PPO config is defined in config.py file.

This set of parameters is a good starting point for seeing that training goes as expected:

ppo_config = (
    PPOConfig()
    # .rl_module(_enable_rl_module_api=False)
    .framework('tf')
    .environment(
        env='CryptoEnv',
        env_config={
            "dataset_name": "dataset",  # .npy files should be in ./data/dataset/
            "leverage": 2, # leverage for perpetual futures
            "episode_max_len": 168 * 2, # train episode length, 2 weeks
            "lookback_window_len": 168, # 1 week
            "train_start": [2000, 7000, 12000, 17000, 22000],
            "train_end": [6000, 11000, 16000, 21000, 26000], 
            "test_start": [6000, 11000, 16000, 21000, 26000],
            "test_end": [7000, 12000, 17000, 22000, 29377-1], 
            "order_size": 50, # dollars
            "initial_capital": 1000, # dollars
            "open_fee": 0.12e-2, # taker_fee
            "close_fee": 0.12e-2, # taker_fee
            "maintenance_margin_percentage": 0.012, # 1.2 percent
            "initial_random_allocated": 0, # opened initial random long/short position up to initial_random_allocated $
            "regime": "training",
            "record_stats": False, # True for backtesting
        },
        observation_space=gym.spaces.Box(
            low=-np.inf,
            high=np.inf,
            shape=(183 * 168,),
            dtype=np.float32
        ),
        action_space=gym.spaces.Discrete(4),
    )
    .training(
        lr=5e-5,
        gamma=0.995, # 1.
        grad_clip=30.,
        entropy_coeff=0.03,
        kl_coeff=0.05,
        kl_target=0.01, # not used if kl_coeff == 0.
        num_sgd_iter=10,
        use_gae=True,
        clip_param=0.3, # larger values for more policy change
        vf_clip_param=10,
        train_batch_size=15 * 8 * 168, # num_rollout_workers * num_envs_per_worker * rollout_fragment_length
        shuffle_sequences=True,
        model={
            "vf_share_layers": False,
            "custom_model": "TransformerModelAdapter",
            "custom_model_config": {
                "d_history_flat": 168 * 183,
                "num_obs_in_history": 168,
                "d_obs": 183,
                "d_time": 2, # hour, day
                "d_account": 2, # unrealized_pnl, available_balance
                "d_candlesticks_btc": 34, # TA indicators
                "d_candlesticks_ftm": 34, # TA indicators
                "d_santiment_btc_1h": 30,
                "d_santiment_btc_1d": 26,
                "d_santiment_ftm_1h": 27, 
                "d_santiment_ftm_1d": 28, 
                "d_obs_enc": 256,
                "num_attn_blocks": 3,
                "num_heads": 4,
                "dropout_rate": 0.1
            }
        }
    )
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=8,
        evaluation_duration_unit='episodes',
        evaluation_parallel_to_training=False,
        evaluation_config={
            "explore": False,
            "env_config": {
                "regime": "evaluation",
                "record_stats": False, # True for backtesting
                "episode_max_len": 168 * 2, # validation episode length
                "lookback_window_len": 168, 
            }
        },
        evaluation_num_workers=4
    )
    .rollouts(
        num_rollout_workers=15,
        num_envs_per_worker=8,
        rollout_fragment_length=168,
        batch_mode='complete_episodes',
        preprocessor_pref=None
    )
    .resources(
        num_gpus=1
    )
    .debugging(
        log_level='WARN'
    )
)

There are several hyperparameters in the configuration that are particularly important:

lr (learning rate): the step size at each iteration while moving toward a minimum of a loss function.
grad_clip: the hyperparameter epsilon in the PPO paper, which is used to clip the policy update to prevent too drastic updates. If this value is too high, the policy may update too much, which can destabilize learning; if it’s too low, the policy may not learn effectively.
num_sgd_iter: equivalent to the number of epochs, defining the number of times to iterate through collected batches of data during the SGD (Stochastic Gradient Descent) process. If this value is too low, it may lead to underfitting; if it’s too high, it may lead to overfitting.
train_batch_size: batch size for all learner workers. A smaller batch size may lead to a noisier gradient estimation, while a larger one may provide a more accurate estimation but at the cost of computational resources.
gamma (discount factor): determines the importance of future rewards. A high gamma makes the agent consider future rewards more heavily, promoting long-term strategies, while a low gamma makes the agent prioritize immediate rewards.
entropy_coeff (entropy coefficient): This provides a bonus reward for more random actions, encouraging exploration. If you’re finding that your agent isn’t exploring the environment enough, you might need to increase this coefficient.

These hyperparameters can significantly affect the performance and stability of the PPO algorithm. You can learn more about PPO-specific config parameters here. The common config for all Ray algorithms is here.

Neural Network Architecture

RLlib supports the use of recurrent/attention models for all its policy-gradient algorithms (A3C, PPO, PG, IMPALA), and the necessary sequence processing support is built into its policy evaluation utilities.

You can start using RLLib models out of the box by setting use_lstm or use_attention parameters to True in your model config. The observation state will be processed by an LSTM layer or an attention (GTrXL) network respectively.

The usage of LSTM is obvious since we work with time series data. Transformer architectures are very promising for our use case, but training a transformer takes much more time and computational resources. Nvidia RTX 2080 Ti with 11 GB memory is enough for training with the published config. NeuralForecast library offers a comprehensive list of architectures you can experiment with.

If you want to create a custom policy and value models, follow this guide.

Custom policy and value models

I’ve implemented more advanced models with superior performance, one of which is open-sourced in a transformer.py file. If you are interested in my research, feel free to contact me.

My neural network architecture represents a sophisticated model designed for processing various types of time-series data. This architecture comprises multiple custom layers and follows a Transformer-based pattern to encode and process sequential data efficiently. It covers pre-processing, encoding transformational data, and producing actionable outputs.

Firstly, let’s outline the significant components and their role in the architecture:

InputSplit layer takes a flat tensor of observation history and splits it into logically separate segments such as time, account information, TA indicators for BTC and FTM, and Santiment data for BTC and FTM over 1-hour and 1-day intervals.
TimeEncoding and AccountEncoding layers encode time and account information into denser representations.
CandlesticksEncoding and SantimentEncoding handle the encoding of TA data and Santiment data for BTC and FTM.
ObsEncodingInternal and ObsEncodingExternal layers combine the internal (time and account) and external (TA and Santiment data) encodings to produce representations that account for internal state and market conditions.
Stem and StemOutputConcatenation serve as the entry point for feeding the encoded data into the transformer model. It processes and combines both internal and external encodings before passing them down the pipeline.
AttentionBlock and TransformerOutputTimePooling make the core of the transformer architecture, applying self-attention mechanisms to capture dependencies across different points in time and different features within the data. The TransformerOutputTimePooling layer aggregates the transformer’s encoded information to focus on the most recent state for decision-making.
ActionBranch and ValueBranch Layers represent the heads of the network responsible for producing actionable outputs. The ActionBranch predicts the next action based on the encoded input sequence, while the ValueBranch estimates the expected returns or value from taking such actions.

TransformerModel is a top-level layer that combines all the components mentioned above into a coherent model architecture, feeding forward through processing units, attention blocks, and finally to output layers for actions and value estimation.

This architecture is designed with flexibility in mind, allowing for adjustment of dimensions, encoding depths, and the number of attention blocks to adapt to different datasets and problem settings. It is particularly suited for RL environments where the model needs to make sequential decisions based on historical data and the most recent internal and external states. The use of attention mechanisms ensures that the network can focus on the most relevant information over the input sequence, improving its capacity to predict complex time-series patterns effectively.

Training

In RLlib there are 2 ways to interact with the algorithm. You can use the basic Python API or Ray Tune to tune hyperparameters of your reinforcement learning algorithm. I prefer the latter option.

Command to start training:

python train.py

Ray Tune API example:

tune.run(
    "PPO",
    stop={"timesteps_total": int(1e10)},
    config=ppo_config,
    local_dir="./results", # default folder "~ray_results" 
    checkpoint_freq=12,
    checkpoint_at_end=False,
    keep_checkpoints_num=None,
    verbose=2,
    reuse_actors=False,
    # resume=True,
    # restore="./results/PPO/PPO_CryptoEnv_1a171_00000_0_2024-05-02_11-51-01/checkpoint_000012"
)

The RL feedback loop repeatedly collects data for agent policies, trains the policies on these collected data, and makes sure the policies’ weights are kept in sync. The environment data contains observations, taken actions, received rewards and so-called done flags, indicating the boundaries of different episodes the agents play through. The training iterates over a loop of action -> reward -> next state -> train -> repeat, until the stop condition is hit.

Tensorboard logs

I usually monitor tensorboard metrics until they start to saturate, and then I stop the training. I do not apply early stopping because monitored metrics are noisier than those in the supervised learning setting, and they can start to improve after a considerable amount of training time. Trained models and tensorboard logs are saved under local_dir every checkpoint_freq epochs. There is a tensorboard file with one of my experiments.

Notes on the most important tensorboard metrics:

average rewards per training epoch; image by author

The average positive reward during training indicates that the agent is profiting from the training dataset since the agent reward is proportional to realized_pnl when closing a position or when the episode end condition is met.

average rewards per validation epoch; image by author

The average validation rewards are not as smooth as training rewards, but you can notice that most of the episode rewards are above zero. This means the agent makes a profit on the validation dataset despite high fees (twice the actual taker fees). Again, the rewards are proportional to the agent’s realized profit.

entropy per training epoch; image by author

Entropy decline proves that the agent becomes more confident in taking specific actions.

policy loss per training epoch; image by author

value function loss per training epoch; image by author

Policy loss and Value function loss are decreasing as expected.

value function explained variance per training epoch; image by author

Value function explained variance increases gradually and then saturates near 0.99 as expected.

average episode length per training epoch; image by author

Sometimes, the length of a training episode is shorter than episode_max_len, which means liquidation occurred. The agent encounters liquidations during training and experience that this hurts.

In train.py you can resume training from the last RLlib run or restore the checkpoint path from which you would like to continue training.

Conclusion

This article provided a comprehensive overview of the training process involved in deep reinforcement learning for crypto trading. We explored the reasoning behind choosing RLlib from Ray as the reinforcement learning library, along with the selection of the Proximal Policy Optimization (PPO) algorithm. The PPO configuration parameters were explained in detail, highlighting their impact on the algorithm’s performance.

We discussed the implementation of a custom neural network architecture designed for processing time-series data effectively. This architecture utilizes a transformer-based approach to capture dependencies within the data and make informed decisions.

Finally, the training process itself was explored, including the feedback loop and key metrics monitored through TensorBoard logs. Understanding these metrics is crucial for evaluating the agent’s performance and determining the optimal stopping point for training

This ends Part 3: Training, see you in the next Part 4: Backtesting.

If you are interested in cooperation, feel free to contact me.