Building a trading bot with Deep Reinforcement Learning(DRL)

Zhi Li
DataPebbles
Published in
7 min readOct 24, 2023

Quantitative trading involves the use of computer algorithms and programs, based on simple or complex mathematical models, to identify and capitalize on available trading opportunities

Nowadays, quantitative trading is gradually being favoured as an emerging investment method. It doesn’t rely on personal intuition or emotion to make investment decisions. Instead, it utilizes quantitative models based on sound investment principles and experiences. It processes vast amounts of data, summarizes market dynamics, and establishes reusable and optimized investment strategies to guide the decision-making processes.

With the development and widespread adoption of machine learning technology, it has also been extensively researched and applied in the field of quantitative trading.

So in this article, I will try to explain the common usage of machine learning technology for quantitative trading and elaborate a detailed process of building a trading bot using Deep Reinforcement Learning.

Supervised Learning vs Reinforcement Learning

Specifically speaking, the application of machine learning in quantitative trading can be divided into two main approaches: supervised learning and reinforcement learning.

Supervised learning methods (e.g. logistic regression, random forest, or LSTM) can predict future stock prices and determine whether stocks will rise or fall based on various historical data.

Supervised Learning

Reinforcement learning is another branch of machine learning that focuses on interpreting its environment and taking appropriate actions to maximize the ultimate reward during decision-making. Unlike supervised learning, which predicts numerical values for the future, reinforcement learning takes input states (such as the opening and closing prices of a given day) and outputs a series of actions (e.g., buy, hold, sell) to maximize the final profit.

(Deep) Reinforcement Learning

Build a Deep Reinforcement Learning bot

Step 1 — Create an OpenAI Gym environment for trading

Like any other deep reinforcement problem, creating a reliable environment is the precondition and key. Here we are going to use the most famous library — OpenAI Gym — to build our stock trading environment. It is well known for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments. Through years of development, Gym’s API has become the field standard for doing this.

On its official website, Gym has provided tutorials to create environments for various purposes. In addition, since we are going to use stable-baselines3 to implement deep reinforcement learning algorithms, our environment should also follow the interface from stable-baselines3.

# TradingEnv.py
import numpy as np
import gymnasium as gym
from gymnasium import spaces
import pandas as pd
import matplotlib.pyplot as plt

class StockTradingEnv(gym.Env):
def __init__(self, data, initial_balance=10000, commission_fee=0.01, slippage_cost=0.1):
super(StockTradingEnv, self).__init__()
self.data = data
self.current_step = 0
self.initial_balance = initial_balance
self.balance = self.initial_balance
self.stock_owned = 0
self.date = data['date']
self.stock_price_history = data['adj_close']
self.commission_fee = commission_fee
self.slippage_cost = slippage_cost

self.action_space = spaces.Box(low=np.array([-1, 0]), high=np.array([1, 1]), shape=(2,)) # (Action, Amount) where Action: -1: Buy, 0: Hold, 1: Sell
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(1,))

self.render_df = pd.DataFrame()
self.done = False
self.current_portfolio_value = initial_balance

def reset(self, seed = None):
self.current_step = 0
self.balance = self.initial_balance
self.stock_owned = 0
self.done = False
self.current_portfolio_value = self.initial_balance
return self._get_observation(), {}

def step(self, action):
assert self.action_space.contains(action)
prev_portfolio_value = self.balance if self.current_step == 0 else self.balance + self.stock_owned * self.stock_price_history[self.current_step - 1]
current_price = self.stock_price_history[self.current_step]
amount = int(self.initial_balance * action[1] / current_price)

if action[0] > 0: # Buy
amount = min( int(self.initial_balance * action[1] / current_price), int(self.balance / current_price * (1 + self.commission_fee + self.slippage_cost)))
if self.balance >= current_price * amount * (1 + self.commission_fee + self.slippage_cost):
self.stock_owned += amount
self.balance -= current_price * amount * (1 + self.commission_fee + self.slippage_cost)
elif action[0] < 0: # Sell
amount = min(amount, self.stock_owned)
if self.stock_owned > 0:
self.stock_owned -= amount
self.balance += current_price * amount * (1 - self.commission_fee - self.slippage_cost)

current_portfolio_value = self.balance + self.stock_owned * current_price
excess_return = current_portfolio_value - prev_portfolio_value
risk_free_rate = 0.02 # Example risk-free rate
std_deviation = np.std(self.stock_price_history[:self.current_step + 1])
sharpe_ratio = (excess_return - risk_free_rate) / std_deviation if std_deviation != 0 else 0
reward = sharpe_ratio

self.render(action, amount, current_portfolio_value)
obs = self._get_observation()

self.current_step += 1

if self.current_step == len(self.data['adj_close']):
done = True
else:
done = False

self.done = done

info = {}
return obs, reward, done, False,info


def _get_observation(self):
return np.array([
self.stock_price_history[self.current_step]
])

def render(self, action, amount, current_portfolio_value, mode = None):
current_date = self.date[self.current_step]
today_action = 'buy' if action[0] > 0 else 'sell'
current_price = self.stock_price_history[self.current_step]

if mode == 'human':
print(f"Step:{self.current_step}, Date: {current_date}, Market Value: {current_portfolio_value:.2f}, Balance: {self.balance:.2f}, Stock Owned: {self.stock_owned}, Stock Price: {current_price:.2f}, Today Action: {today_action}:{amount}")
else:
pass
dict = {
'Date': [current_date], 'market_value': [current_portfolio_value], 'balance': [self.balance], 'stock_owned': [self.stock_owned], 'price': [current_price], 'action': [today_action], 'amount':[amount]
}
step_df = pd.DataFrame.from_dict(dict)
self.render_df = pd.concat([self.render_df, step_df], ignore_index=True)

def render_all(self):
df = self.render_df.set_index('Date')
fig, ax = plt.subplots(figsize=(18, 6))
df.plot( y="market_value" , use_index=True, ax = ax, style='--' , color='lightgrey')
df.plot( y="price" , use_index=True, ax = ax , secondary_y = True , color='black')

for idx in df.index.tolist():
if (df.loc[idx]['action'] == 'buy') & (df.loc[idx]['amount'] > 0):
plt.plot(
idx,
df.loc[idx]["price"] - 1,
'g^'
)
plt.text(idx, df.loc[idx]["price"]- 3, df.loc[idx]['amount'] , c= 'green',fontsize=8, horizontalalignment='center', verticalalignment='center')
elif (df.loc[idx]['action'] == 'sell') & (df.loc[idx]['amount'] > 0):
plt.plot(
idx,
df.loc[idx]["price"] + 1,
'rv'
)
plt.text(idx, df.loc[idx]["price"] + 3, df.loc[idx]['amount'], c= 'red',fontsize=8, horizontalalignment='center', verticalalignment='center')

There are 4 important concepts to understand:

  • Observation. A strategy network observes various parameters of a stock, such as the closing price, trading volume, technical indicators, etc. For convenience, here we simply use the adjusted price as a one-dimensional observation space. (Some of these values can be quite large, for instance, the trading amount or volume could be in the millions or even larger. For the network to converge during training, the observed state data must be normalized and transformed to the [-1, 1] range when input into the network.)
  • Action. We don’t yet consider short selling or trading with margin. So there are three types of actions in trading: buy, sell, and hold, you can define an action as a 2-element array:action[0] represents the action type, andaction[1] represents the position.
  • Reward. Reward is a result of taking action. The design of the reward function is crucial for reinforcement learning. Here we use the Sharpe ratio as our reward function, encouraging the agent to maximize returns while minimizing risk.
  • Render. It computes the plots to show the progress of training and validations.

Step 2 — Feed the environment with training data and build a DRL agent with stable-baselines3

When we have properly defined our trading environment, we can start feeding the environment with historical market data. Yfinance library is used to obtain the end-of-day price data of Apple (AAPL), from 2021–3–1 to 2022–3–1, as training data.

The next step is to choose the algorithms. Because the output of actions is continuous, policy gradient-based optimization algorithms are chosen. One well-known algorithm in this context is the PPO (Proximal Policy Optimization) algorithm. OpenAI and many research papers have identified PPO as a preferred algorithm in reinforcement learning. Stable-baselines3 provides a reliable implementation of the PPO optimization algorithm.

Other famous DRL algorithms, such as A2C, DDPG, DQN, HER, SAC, and TD3, can be found at the stable-baselines3 website.

# train.py
from TradingEnv import StockTradingEnv
from pybroker import YFinance
import pybroker
pybroker.enable_data_source_cache('yfinance')
import pandas as pd
from stable_baselines3 import PPO

yfinance = YFinance()
df = yfinance.query(['AAPL'], start_date='3/1/2021', end_date='3/1/2022')
df['date'] = pd.to_datetime(df['date']).dt.date
env = StockTradingEnv(df, initial_balance=100000, commission_fee=0.0001, slippage_cost=0.005)

model = PPO("MlpPolicy", env, verbose=0)
model.learn(total_timesteps=100_000, progress_bar=True)
model.save("ppo_aapl")

The trained model will be saved as “ppo_aapl.zip” then.

Step 3— Validate the model and assess performance

Similar to the training environment, we can use the same approach to build a validation environment. Price data of Apple (AAPL) from 2022–3–1 to 2023–3–1 is used for model validation.

# validate.py
from TradingEnv import StockTradingEnv
from pybroker import YFinance
import pandas as pd
from stable_baselines3 import PPO

yfinance = YFinance()
df = yfinance.query(['AAPL'], start_date='3/1/2022', end_date='3/1/2023')
df['date'] = pd.to_datetime(df['date']).dt.date
env = StockTradingEnv(df, initial_balance=100000, commission_fee=0.0001, slippage_cost=0.005)

model = PPO.load("ppo_aapl", env=env)

vec_env = model.get_env()
obs = vec_env.reset()
for i in range(len(df['adj_close'])):
action, _state = model.predict(obs)
obs, reward, done, info = vec_env.step(action)

env.render_all()

Once validation is finished, we can use the render_all function to plot the market value curve. It is also possible to visualize the buy and sell operations alongside the price curve. Further analysis can be conducted to check if there is overfitting/underfitting of the model.

For a more professional analysis of the portfolio performance, you can check quantstats.

Conclusion

In this article, we have walked through the process of building a trading bot using deep reinforcement learning (DRL) algorithm. This includes creating a stock trading environment, feeding training and validating datasets, and selecting DRL agents for training. It is noteworthy that there are several limitations in the process we have shown:

  • Feature selection. We only used a single feature — price — as the observation space. In a real trading environment, there may be (tons of) thousands of features being used. This could be a big topic to explore. So we don’t cover too much in this article.
  • Single stock agent. We only use data from one stock to train our trading agent. To improve the generalization of our model, multi-stock environments should be created. See example — finRL.
  • Simple reward function. We simply used the Sharpe ratio as our reward function, which may not suit all cases. This should be adjusted according to different investment preferences.
  • Lack of explainability and stability. The performance of the trained agent still lacks stability. Buy/Sell operations still seem to be random when we run validation multiple times. Lack of explainability is also a limitation of Deep reinforcement learning. For future optimizations, some actions will be taken: (1) increase training time and use GPU (2) Use more training data/features (3) investigate hyperparameters tuning of the models

If you have any questions or suggestions about this article, please contact me at zhi.li@datapebbles.com. Stay tuned for future updates.

--

--