Deep Reinforcement Learning for Crypto Trading

Part 2: Trading Strategy

Alex K
Coinmonks
Published in
12 min readMay 17, 2024

--

photo by Art Rachen on Unsplash

Disclaimer: The information provided herein does not constitute financial advice. All content is presented solely for educational purposes.

Introduction

This is the second part of my blog post series on reinforcement learning for crypto trading:

This article explains the trading strategy and the reinforcement learning environment.

As I mentioned in Part 0: Introduction, my main goal is to connect with potential employers or investors and ultimately become a professional quant.

Resources

Trading Strategy

Bots trade perpetual futures contracts with leverage 2 (I experimented with different leverages, but we will stick to leverage 2 throughout the entire blog series; this value provides a meaningful risk-reward ratio for volatile altcoins trading).

The published strategy operates in One-Way mode; the bot can hold only a long or a short position at any given point. I also created bots trading only longs, only shorts, and bots operating in a Hedge mode, which is not included in the published version.

According to my experiments, “long” bots are very profitable in a bull market (even a poorly trained agent can be profitable if allowed to open only long positions during an uptrend), well-trained agents rarely open long positions during a bear market. Bots operating in Hedge mode often learn a hedging strategy and simultaneously hold a long and a short position of equal sizes.

Dollar-cost averaging (DCA) is another concept I’ve used. The idea is not to let bots invest all their available capital at once but rather in small portions per timestep. Bots operate on a 1-hour timeframe, so they can open/close long/short positions or choose to do nothing every 1 hour.

I use the concept of position average_price. The average price is recalculated when we add to the position (sell action for short position or buy action for long position). The average price is not recalculated if the position size is reduced (buy action for shorts or sell for longs).

You can freely play with different available_balance, order_size and leverage ratios. Just be careful, there is a minimum position size in coins or USDT equivalent that cryptocurrency exchanges allow to open. For example, on Bybit you cannot open a position smaller than 1 FTM. At the time of writing FTM all-time-high is around 3.5$ and I believe it will rise further after the Bitcoin halving in 2024, so adjust your order size to be equal to or greater than the FTM price.

There are two options for how to close a position:

  1. Reduce by order_size, that is reduce position_value by 100$. Position reduction can take place in multiple time steps.
  2. Close the entire position at once, no matter how large it is. This option is useful in case of black-swan events or rapid market swings, similar to the “Panick Sell” button.

I’ve developed more advanced strategies but the strategy explained in this guide is a good starting point. Market orders are used to open or close positions. I just want to keep it simple for blog purposes, limit orders however have smaller fees and it’s possible to create more advanced strategies using even more complex order types.

hedging mode on Bybit, long and short positions are opened simultaneously; image by author

In the hedging mode both long and short positions are open. If the agent “predicts” the price will rise, position_value_long is greater than position_value_short, and vice versa. If the price starts rising, the agent will close the short position at a small loss, and after a further price increase, the long position will be closed with a larger profit. The long/short position_value ratio itself can serve as an indicator of the agent’s “forecast” regarding the future market trend. Often the values of both positions are equal (hedging) in the case of a neutral market. This strategy works better in trending markets.

Gym environment

A reinforcement learning agent needs an environment — a set of rules it can follow. Using the Gymnasium library, I coded an appropriate environment for my strategy.

Config for the environment:

  env_config={
"dataset_name": "dataset", # .npy files should be in ./data/dataset/
"leverage": 2, # leverage for perpetual futures
"episode_max_len": 168 * 2, # train episode length, 2 weeks
"lookback_window_len": 168, # 1week
"train_start": [2000, 7000, 12000, 17000, 22000],
"train_end": [6000, 11000, 16000, 21000, 26000],
"test_start": [6000, 11000, 16000, 21000, 26000],
"test_end": [7000, 12000, 17000, 22000, 29377-1],
"order_size": 50, # dollars
"initial_capital": 1000, # dollars
"open_fee": 0.12e-2, # taker_fee
"close_fee": 0.12e-2, # taker_fee
"maintenance_margin_percentage": 0.012, # 1.2 percent
"initial_random_allocated": 0, # opened initial random long/short position up to initial_random_allocated $
"regime": "training",
"record_stats": False, # True for backtesting
},

The environment simulates interactions on the exchange. It has three main attributes: observation_space, action_space, and reward.

Action space

I already explained an action_space in the previous section. Let’s now formalize it:

action_space = gymnasium.spaces.Discrete(4)

There are 4 discrete actions:

  1. Do nothing
  2. Add to the long/short position an order_size amount of USDT
  3. Reduce a long/short position by order_size amount of USDT
  4. Close entire position

In Hedge mode, the action space has 4 * 4 = 16 options, making exploration quadratically more complex.

I’ve also experimented with training agents with continuous action spaces but discrete action spaces have worked better so far. A continuous action space increases the complexity of the problem and makes the learning process longer. For instance, deciding how much of a particular coin to buy or sell at any given time involves infinite choices, making it harder for the agent to learn optimal policies. Discrete action spaces generally lead to faster convergence during training.

Observation space

Throughout this blog series, we use a custom Transformer architecture, which is discussed in detail in Part 3: Training. Transformers store previous timesteps using a window approach. Neural network input is a 3D tensor of dimensions (batch_size, sequence_length, num_features).

sequence_length corresponds to lookback_window_len parameter in config.py and is equal to 168, which is 168 hours or 1 week. Agent receives 1 last week of data as an observation state to make a decision every timestep.

num_features consists of two parts:

  1. The static (internal state) part is stored on disk in a metrics_outfile.npy, as discussed in Part 1: Data preparation. There are 181 static features: day, hour, TA indicators, on-chain and sentiment metrics for FTM and BTC.
  2. The dynamic (external state) part consists of 2 exchange account parameters: available_balance and unrealised_pnl. These two features are dynamic because they change depending on the agent’s previous actions. available_balance gives an agent information on whether it can add to the existing position; unrealised_pnl signals the number of rewards the agent receives if it closes the position at the current timestep.

Scaler class scales and preprocesses data to a valid observation format:

class Scaler:
def __init__(self, min_quantile: int = 1, max_quantile: int = 99, scale_coef: float = 1e3) -> None:
self.transformer = None # <class 'sklearn.preprocessing._data.RobustScaler'>
self.min_quantile = min_quantile
self.max_quantile = max_quantile
self.scale_coef = scale_coef

def reset(self, state_array, reset_array):
# don't apply scaler to day, hour, unrealized_pnl and available_balance columns
self.transformer = RobustScaler(quantile_range=(self.min_quantile, self.max_quantile)).fit(reset_array[:, 4:])
scaled_np_array = self.step(state_array)

return scaled_np_array

def step(self, state_array):
day_column = state_array[:, [0]]
hour_column = state_array[:, [1]]
available_balance = state_array[:, [2]] / self.scale_coef
unrealized_pnl = state_array[:, [3]] / self.scale_coef
transformed_indicators = np.clip(self.transformer.transform(state_array[:, 4:]), a_min=-10., a_max=10.)
scaled_np_array = np.hstack((day_column, hour_column, available_balance, unrealized_pnl, transformed_indicators)).astype(np.float32)
return scaled_np_array

The scaled_np_array is the 2D array with columns:

  • day_column
  • hour_column
  • available_balance
  • unrealized_pnl
  • transformed_indicators (includes TA indicators, on-chain and sentiment metrics)

observation_space is a flattened 1D vector with shape (183 * 168, ):

observation_space=gymnasium.spaces.Box(
low=-np.inf,
high=np.inf,
shape=(183 * 168,),
dtype=np.float32
)

Reward function

The agent learns from the rewards it receives, positive rewards for doing “right” actions and negative rewards for doing “wrong” actions. The simplest possible reward function returns a normalized profit and loss when a position is closed. The reward is proportional to realized_pnl in this case. For example, if an agent closes a position in 50$ profit, we can divide it by initial_balance of 1000$ so the reward equals 0.05 to ensure rewards are always in a meaningful range of [-1:1]

Realized reward function:

equation 1

In code:

reward = (self.reward_realized_pnl_short + self.reward_realized_pnl_long) / self.initial_balance

The problem of sparse rewards can arise if the agent doesn’t close positions for a long period of time. It becomes more difficult for the agent to capture the correlation between actions in the past and recent rewards received for closing a position.

Another reward function can give rewards each timestep depending on how account equity has changed compared to the previous timestep.

Equity reward function:

equation 2

In this case, rewards can become very noisy and harder to learn from.

I’ve also tested a combined reward: the agent receives a reward for closing a position (realized_pnl) and also a small reward equal to the fraction of unrealized_pnl during each timestep.

equation 3

In simpler terms, the coef is like a slider that adjusts how much weight we give to unrealized profits and losses compared to realized ones. For example, if the coef is set at 0.01, it means that holding a position at a loss for 100 time steps is as bad as just closing it out right away. In both cases, the cumulative negative rewards would be the same.

It’s also possible to punish an agent for holding a position in loss and not giving it any positive rewards for holding a position in profit. Many papers use the Sharpe or Sortino ratio as a reward function; you can also try it. The choice of reward function dramatically affects training dynamics.

When the episode ends after episode_max_len timesteps have been played, the agent closes all open positions and receives an appropriate reward.

Step & Reset methods in the training environment

I did my best to capture all the intricacies to simulate processes under the hood on the exchange during opening/closing positions and the liquidation process. The training environment is not one-to-one similar to real exchanges, but I believe it catches all main concepts such as:

  • bid/ask spread
  • opening/closing fees
  • all calculations between exchange account parameters such as: equity, wallet_balance, available_balance, margin, position_value, unrealized_pnl, etc
  • liquidation process
  • logical things, such as: the agent cannot open a new position if available_balance is lower than order_size, etc

Environment doesn’t take into account funding fees and slippage. This is compensated by a fee twice as large as the real one, which the agent pays for opening/closing positions. At the time of writing, the non-Vip taker_fee on Bybit is 0.06%. I use 0.12% during training.

As stated:

“The more you sweat in training, the less you bleed in combat.”

The bid/ask spread is constant rather than the exchange’s real bid/ask price. In my environment, the bid/ask price deviates from the middle price by the taker_fee * 2.

During training, the algorithm randomly samples a timestep for the episode beginning from the training dataset and then plays for episode_max_len timesteps calling step method every timestep. Episode length (336 timesteps) corresponds to 2 weeks. After that, the reset method returns the agent to the initial state and the process repeats. If the liquidation happens in the middle of the episode, the agent receives a negative reward and the episode ends (episode length in this case is shorter than 336).

Code for the step method of LearningCryptoEnv class:

  def step(self, action: int):
assert action in [0, 1, 2, 3], action

self.price_bid = self.price_array[self.time_absolute, 0] * (1 - self.open_fee)
self.price_ask = self.price_array[self.time_absolute, 0] * (1 + self.open_fee)

margin_short_start = self.margin_short
margin_long_start = self.margin_long

self.reward_realized_pnl_short = 0.
self.reward_realized_pnl_long = 0.

# Oneway actions
if action == 0: # do nothing
self.reward_realized_pnl_long = 0.
self.reward_realized_pnl_short = 0.

# similar to "BUY" button
if action == 1: # open/increace long position by self.order_size
if self.coins_long >= 0:
if self.available_balance > self.order_size:
buy_num_coins = self.order_size / self.price_ask
self.average_price_long = (self.position_value_long + buy_num_coins * self.price_ask) / (self.coins_long + buy_num_coins)
self.initial_margin_long += buy_num_coins * self.price_ask / self.leverage
self.coins_long += buy_num_coins

if -self.coins_short > 0: # close/decreace short position by self.order_size
buy_num_coins = min(-self.coins_short, self.order_size / self.price_ask)
self.initial_margin_short *= min((-self.coins_short - buy_num_coins), 0.) / -self.coins_short
self.coins_short = min(self.coins_short + buy_num_coins, 0) # cannot be positive
realized_pnl = buy_num_coins * (self.average_price_short - self.price_ask) # buy_num_coins is positive
self.wallet_balance += realized_pnl
self.reward_realized_pnl_short = realized_pnl

# similar to "SELL" button
if action == 2: # close/reduce long position by self.order_size
if self.coins_long > 0:
sell_num_coins = min(self.coins_long, self.order_size / self.price_ask)
self.initial_margin_long *= (max((self.coins_long - sell_num_coins), 0.) / self.coins_long)
self.coins_long = max(self.coins_long - sell_num_coins, 0) # cannot be negative
realized_pnl = sell_num_coins * (self.price_bid - self.average_price_long)
self.wallet_balance += realized_pnl
self.reward_realized_pnl_long = realized_pnl

if -self.coins_short >= 0: # open/increase short position by self.order_size
if (self.available_balance > self.order_size):
sell_num_coins = self.order_size / self.price_ask
self.average_price_short = (self.position_value_short + sell_num_coins * self.price_bid) / (-self.coins_short + sell_num_coins)
self.initial_margin_short += sell_num_coins * self.price_ask / self.leverage
self.coins_short -= sell_num_coins

self.liquidation = -self.unrealized_pnl_long - self.unrealized_pnl_short > self.margin_long + self.margin_short
self.episode_maxstep_achieved = self.time_relative == self.max_step

# CLOSE entire position or LIQUIDATION
if action == 3 or self.liquidation or self.episode_maxstep_achieved:
# close LONG position
if self.coins_long > 0:
sell_num_coins = self.coins_long
# becomes zero
self.initial_margin_long *= max((self.coins_long - sell_num_coins), 0.) / self.coins_long
# becomes zero
self.coins_long = max(self.coins_long - sell_num_coins, 0)
realized_pnl = sell_num_coins * (self.price_bid - self.average_price_long)
self.wallet_balance += realized_pnl
self.reward_realized_pnl_long = realized_pnl

# close SHORT position
if -self.coins_short > 0:
buy_num_coins = -self.coins_short
# becomes zero
self.initial_margin_short *= min((self.coins_short + buy_num_coins), 0.) / self.coins_short
# becomes zero
self.coins_short += buy_num_coins
realized_pnl = buy_num_coins * (self.average_price_short - self.price_ask) # buy_num_coins is positive
self.wallet_balance += realized_pnl
self.reward_realized_pnl_short = realized_pnl

self.margin_short, self.margin_long = self._calculate_margin_isolated()
self.available_balance = max(self.wallet_balance - self.margin_short - self.margin_long, 0)
self.unrealized_pnl_short = (-self.coins_short * (self.average_price_short - self.price_ask)) # self.coins_short is negatve
self.unrealized_pnl_long = (self.coins_long * (self.price_bid - self.average_price_long)) # - self.fee_to_close_long
next_equity = (self.wallet_balance + self.unrealized_pnl_short + self.unrealized_pnl_long)

done = self.episode_maxstep_achieved or self.liquidation # end of episode or liquidation event

# reward function
# normalize rewards to fit [-10:10] range
reward = (self.reward_realized_pnl_short + self.reward_realized_pnl_long) / self.initial_balance
# reward = (next_equity - self.equity) / self.initial_balance # reward function for equity changes

self.equity = next_equity

margin_short_end = self.margin_short
margin_long_end = self.margin_long

obs_step = self._get_observation_step(self.time_absolute)
obs = self.scaler.step(obs_step).flatten()

self.statistics_recorder.update(
action=action,
reward=reward,
reward_realized_pnl_short=self.reward_realized_pnl_short,
reward_realized_pnl_long=self.reward_realized_pnl_long,
unrealized_pnl_short = self.unrealized_pnl_short,
unrealized_pnl_long = self.unrealized_pnl_long,
margin_short_start=margin_short_start,
margin_long_start=margin_long_start,
margin_short_end=margin_short_end,
margin_long_end=margin_long_end,
num_steps=self.time_relative,
coins_short=self.coins_short,
coins_long=self.coins_long,
equity=self.equity,
wallet_balance=self.wallet_balance,
average_price_short = self.average_price_short,
average_price_long = self.average_price_long,
)

info = self.statistics_recorder.get()

self.time_absolute += 1
self.time_relative += 1

return obs, reward, done, False, info

I usually invest 1000$ per bot. With leverage 2, the maximum position_value is 2000$. If order_size is 50$ and leverage is 2, the open position_value is 100$ reduced by open_fee. The available_balance after opening a position with order_size 50$ is reduced by 50$. The strategy opens orders in amounts 5% of the initial_balance.

Each timestep bot can increase or decrease position_value by 100$. Assuming the maximum position_value 2000$, the minimum time to allocate the whole available_balance is 2000 / 100 * 1 hour = 20 hours, in case of adding to a position for 20 hours in a row. That is, the agent must be very confident that the price will rise to open multiple long positions in a row for a few hours.

reset method:

  def reset(self, seed=7, options={}):
self._reset_env_state()
state_array, reset_array = self._get_observation_reset()
scaled_obs_reset = self.scaler.reset(state_array, reset_array).flatten()

# return scaled_obs_reset
return scaled_obs_reset, {}

I’ve implemented a few more ideas:

  • more recent timesteps are more important and should be sampled more frequently during training than older ones.
  • The initial state is not constant every time after reset method is called, I simulated that with a certain probability at the beginning of the episode, the agent already has open long or short positions at the particular price levels; so it should learn the best policy even at the unfavorable starting conditions. I implemented a logic that as the training progresses (more training epochs pass) the starting conditions become more and more challenging. Hence, it becomes harder for the agent to get positive rewards.
  • Users can define several training and validation intervals inside the dataset at different timesteps, rather than validating on the last few months of data; cross-validation provides more meaningful validation results obtained from different market regimes.

Conclusion

This article described the core components of a deep reinforcement learning environment for cryptocurrency trading with a focus on perpetual futures contracts. I introduced a trading strategy that utilizes concept of dollar-cost averaging. The environment uses the Gymnasium library and provides observations, actions, and rewards tailored to the cryptocurrency trading domain.

We explored different aspects including:

  • Trading strategy with long/short order types and position management.
  • Gymnasium environment with its key attributes: action space, observation space, and reward function.

This ends Part 2: Trading strategy, see you in the next Part 3: Training.

If you are interested in cooperation, feel free to contact me.

Contacts

My email: alex.kaplenko@sane-ai.dev

My LinkedIn: https://www.linkedin.com/in/alex-sane-ai/

GitHub: https://github.com/xkaple00/deep-reinforcement-learning-for-crypto-trading

Link to support Ukraine: https://war.ukraine.ua/donate/

Visit us at DataDrivenInvestor.com

Subscribe to DDIntel here.

Join our creator ecosystem here.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Follow us on LinkedIn, Twitter, YouTube, and Facebook.

--

--