Deep Reinforcement Learning for Crypto Trading

Part 4: Backtesting

Published in

Coinmonks

7 min readMay 17, 2024

Disclaimer: The information provided herein does not constitute financial advice. All content is presented solely for educational purposes.

Introduction

This is the fourth part of my blog post series on reinforcement learning for crypto trading.

This article explains the backtesting framework I created.

The results presented in this blog post are obtained using different PPO hyperparameters, policy network, and dataset than described in the previous blog posts. The dataset has a 5-minute resolution instead of 1 hour. The reward function used is a weighted sum of realized_pnl and unrealized_pnl.

As I mentioned in Part 0: Introduction, my main goal is to connect with potential employers or investors and ultimately become a professional quant.

Resources

Multi-model analysis

After training is completed, we must evaluate our algorithm on historical data (backtesting).

During training, checkpoints are saved every checkpoint_freq epoch in the ./results/PPO folder:

tune.run(
    "PPO",
    stop={"timesteps_total": int(1e10)},
    config=ppo_config,
    local_dir="./results", # default folder "~ray_results" 
    checkpoint_freq=12,
    checkpoint_at_end=False,
    keep_checkpoints_num=None,
    verbose=2,
    reuse_actors=False,
    # resume=True,
    # restore="./results/PPO/PPO_CryptoEnv_1a171_00000_0_2024-05-02_11-51-01/checkpoint_000012"
)

You can end up with a couple hundred saved checkpoints after a few days of training. The question is which checkpoint to choose for real trading on the crypto exchange. The last saved checkpoint does not necessarily have the best risk/reward ratio on the validation dataset since the policy network tries different strategies to maximize cumulative reward on the training dataset. I developed a logic under MULTI MODEL TEST section of the backtest.ipynb to calculate a set of metrics for each checkpoint on the validation dataset using the StatisticsRecorder class:

annual_volatility
sharpe_ratio
calmar_ratio
stability
max_drawdown
omega_ratio
sortino_ratio
tail_ratio
daily_value_at_risk
final_return (total)
final_return_long
final_return_short

Pyfolio library is used. JSON file with all calculated metrics is saved on disk for further analysis.

To activate recording statistics set record_stats flag to True in config.py:

env_config={
    "dataset_name": "dataset",  # .npy files should be in ./data/dataset/
    "leverage": 2, # leverage for perpetual futures
    "episode_max_len": 168 * 2, # train episode length, 2 weeks
    "lookback_window_len": 168, 
    "train_start": [2000, 7000, 12000, 17000, 22000],
    "train_end": [6000, 11000, 16000, 21000, 26000], 
    "test_start": [6000, 11000, 16000, 21000, 26000],
    "test_end": [7000, 12000, 17000, 22000, 29377-1], 
    "order_size": 50, # dollars
    "initial_capital": 1000, # dollars
    "open_fee": 0.12e-2, # taker_fee
    "close_fee": 0.12e-2, # taker_fee
    "maintenance_margin_percentage": 0.012, # 1.2 percent
    "initial_random_allocated": 0, # opened initial random long/short position up to initial_random_allocated $
    "regime": "training",
    "record_stats": True, # True for backtesting
}

The metrics I pay close attention to are:

final_return (should be as high as possible)
max_drawdown (should be as low as possible)
ratio between final_return_long and final_return_short

Ideally, I want the bot to be profitable on shorts and not lose money on longs during a downtrend, and vice versa during an uptrend.

The for loop to evaluate each saved checkpoint:

for i, checkpoint_path in enumerate(checkpoints_paths): 
    print("index:", i, "/", len(checkpoints_paths))
    print("checkpoint_path:", checkpoint_path)

    agent.load_checkpoint(checkpoint_path)

    done = False
    account_equity = []
    account_balance = ppo_config["env_config"]["initial_capital"]
    total_reward = 0
    obs = env.reset()[0]

    while not done:
        action = agent.compute_single_action(observation=obs, prev_action=None, prev_reward=None)
        
        obs, reward, done, _, info = env.step(action)
        account_balance = info["equity"][-1]
        account_equity.append(account_balance)

        total_reward += reward

    final_return.append(account_equity[-1])
    final_return_long.append(np.sum(info["reward_realized_pnl_long"]) + info['unrealized_pnl_long'][-1])
    final_return_short.append(np.sum(info["reward_realized_pnl_short"]) + info['unrealized_pnl_short'][-1])

    perf_stats_all = backtest_stats(account_equity)

Results

I publish the results of one of my experiments (different PPO hyperparameters, policy network, and dataset than described in the previous blog posts). Dataset resolution is 5 min, initial_balance is 5000$ and order_size is 50$. The bot can invest 1% of its initial_balance per timestep.

evaluation dataset FTM price (strong downtrend for 55 days); image by author

There are 188 checkpoints saved during training.

equity at the end of the episode for each of the 188 saved checkpoints:

final equity per checkpoint; image by author

max_drawdown for each of the 188 saved checkpoints:

max_drawdown per checkpoint; image by author

It is clear that as the training progresses, the policy network explores different trading strategies. The strategies learned by the first 50 checkpoints were almost liquidated (90% drawdown) in such a strong downtrend and their final equity was about 2000$ (loss of 3000$). The policy for checkpoints in the range [115:160] looks much better since max_drawdown is only around 20% and the equity at the end of the episode is about 12000$ (7000$ profit).

Single-model analysis

After choosing a few checkpoints with the best risk/reward ratio and metrics such as Sharpe and Sortino ratio on the validation dataset, we can perform a more in-depth analysis on them under the SINGLE MODEL BACKTEST section of the backtest.ipynb. Let’s try checkpoint number 158. From calculated statistics, I see that its max_drawdown is -0.182 (18.2%), profit on long positions is 1213.91$, and profit on short positions is 6459.7$. Despite the strong downtrend, the agent appears profitable, opening long positions at the local bottoms.

We can visualize the equity on the validation dataset. Equity takes into account realized and unrealized profits.

Wallet balance (takes into account only realized profits).

We can also visualize profits realized for long and short positions separately.

It’s also possible to visualize the cumulative rewards the agent receives. The reward function is a combination of realized_pnl and unrealized_pnl. The highlighted areas correspond to periods of time when the agent holds a position at a loss and receives small negative rewards each time step that accumulate over time.

agent cumulative reward; image by author

In the notebook, a few more metrics can be visualized: reward_realised_pnl_short and reward_realised_pnl_long (per timestep, not cumulative), unrealised_pnl_short and unrealised_pnl_long to see periods of drawdown. I also calculate how many times the agent took each action. During a downtrend, it should open short positions more frequently than long positions and vice versa.

Cross-validation

The SINGLE MODEL SLIDING WINDOW BACKTEST section offers an option to do a backtest using cross-validation. That is, evaluate not over the whole validation dataset but rather shift the sliding window (of length episode_length) in each iteration of the for loop by episode_start_shift timesteps.

ppo_config[“env_config”] values are overridden for each sliding window interval:

episode_start_shift = 168
episode_length = 168 * 2
start_index = 26000
end_index = 29377

account_equity = []

for test_start in range(start_index, end_index - 1, episode_start_shift):
    ppo_config["env_config"]["test_start"] = [test_start]
    ppo_config["env_config"]["test_end"] =  [min(test_start + episode_length, end_index) - 1]
    ppo_config["env_config"]["episode_max_len"] = ppo_config["env_config"]["test_end"][0] - ppo_config["env_config"]["test_start"][0]

This approach is more informative because we can test how well the agent performs under different starting conditions.

There is a result of cross-validation with the parameters: episode_start_shift is 512, episode_length is 4096 (approx. 2 weeks considering 5 min timesteps), initial_balance is 5000$. The chart shows equity at the end of each cross-validation interval.

equity per sliding window interval; image by author

As we can see, the agent completed only 1 episode out of 32 with significant losses. This is episode number 15. An equity of 6500$ means a profit of 1500$ or 1500/5000*100 = 30% return in two weeks.

Plot trades

I’ve also created a framework to visualize individual trades during backtesting, which helps me better understand the learned policy from the human perspective and build trust in AI decisions.

The orange line is a relative FTM price; blue line is cumulative realized_pnl; average_price for position is visualized in purple. Vertical red/green lines correspond to closing the entire position (action number 4), no matter how large it is. Green/red tickers correspond to adding to or subtracting from a position. Zoomed-in charts are viewed better.

Since different learned strategies can be better suited for different market conditions and there is no one-to-rule-them-all strategy, I usually choose a few different checkpoints to deploy and diversify risks.

Conclusion

This article presented a framework for backtesting a deep reinforcement learning agent trained for crypto trading. The backtesting process allows us to evaluate the performance of the trained agent on historical data and select the best performing models for potential live trading.

We explored different metrics to assess the agent’s performance, including final return and maximum drawdown. We also visualized the agent’s equity, wallet balance, and realized/unrealized profits to gain a deeper understanding of its trading behavior.

The backtesting framework also incorporates cross-validation techniques to assess the agent’s performance under various market conditions. Finally, we discussed the importance of visualizing individual trades to gain insights into the learned policy and build trust in the agent’s decision-making capabilities.

By employing this backtesting framework, we can effectively identify promising trading strategies and refine the agent’s behavior before deploying it in a live trading environment.

This ends Part 4: Backtesting, see you in the next Part 5: Live trading.

If you are interested in cooperation, feel free to contact me.