Applied RL: Custom Gym environment for multi-stock RL based Algo trading

7 min readJun 6, 2022

OpenAI open-sourced the Gym library for environment development in python. While it is mainly used for RL research, with many researchers coming up with better RL algorithms to improve the gameplay performance of Atari game environments implemented in it, we will use it to trade multiple stocks in the intraday setting. In the previous article, we discussed how to preprocess the data for our environment.

Applied RL: Custom RL for Algo Trading — Data Preprocessing

We are going to build a custom Gym environment for multi-stock trading with a customized policy in stablebaselines3…

medium.com

While there are some good libraries with pre-implemented Gym environments for trading, I could not find one for multi-stock trading that can be used for inference without major memory manipulation on the backend making things unreliable. I thought it a good exercise to implement one for my specific use case.

OpenAI Gym provides a framework for designing new environments for RL agents to learn tasks such as playing games, we will use it to build our trading environment.

Each Gym environment must have the following methods implemented:

def CustomGymEnv(gym.Env):
    def __init__(self, args):
        #Define all the data that will be stored in the CustomEnv
        super(CustomGymEnv, self).__init__():
        # spaces        # the action space for our agent which predicts score between (-1,1) for each stock as recommendation
        self.action_space = spaces.Box(low=-1, high=1, shape=(num_stocks,), dtype=np.float32)        # the observation space which is (num_stocks, window_size, state_space)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(num_stocks,window_size,state_space), dtype=np.float32)    def reset():
        #Reset the environment variables for a new game cycle
    def step():
        #Execute a single round of trading within the environment
    def render():
        #Render a visualization of the environment to the screen

For our trading use case, we will have to make some changes so that we track the right metrics and process the raw data into a format we can utilize for our purpose. We initialize the environment with the following information:

dfs — a list of data frames containing the raw historical data, one for each asset
price_df — a data frame containing the historical closing prices of each stock as a column in each
initial_amount — the initial amount invested into the model for training
trade_cost — the cost of each trade, currently set to zero
num_features — the number of features per time interval for each asset
num_stocks — number of assets to be traded
window_size — the number of previous time intervals data to be considered for the next action
frame_bound — the range of values (start_index, end_index)in the price_df index to be used for training/test, start_index must be greater than window_size
scalers — the list of scalers used for scaling the data for each stock, internally defined if not already provided
tech_indicator_list — the list of technical indicators to be utilized for trading, num_features is updated if provided
reward_scaling — scaling the rewards/profits from our trading

class MultiStockTradingEnv(gym.Env):     metadata = {"render.modes": ["human"]}     def __init__(self, dfs, price_df, initial_amount, trade_cost, num_features, num_stocks, window_size, frame_bound, scalers=None, tech_indicator_list=[], reward_scaling=1e-5):

Since we are creating a trading environment we have to keep track of some more numbers such as the portfolio_value, margin, reserves, etc.

self.margin = initial_amountself.portfolio = [0]*num_stocksself.PortfolioValue = 0self.reserve = initial_amount

The reset method for our environment will be rather simple, we reset all trackers to empty, the margin amount is set to the initial amount, and portfolio values are set to zero. The reset function must return the next observation i.e. the first set for new training.

def reset(self):    self._done = False    self._current_tick = self._start_tick    self._end_tick = len(self.prices)-1    self._last_trade_tick = self._current_tick - 1    self._position = np.zeros(self.assets)    self._position_history = (self.window_size * [None]) + [self._position]    self.margin = self.initial_amount    self.portfolio = [0]*self.assets    self.PV = 0    self.reserve = self.initial_amount    self._total_reward = 0.    self._total_profit = 1.  # unit    self._first_rendering = True    self.history = {}    return self._get_observation()

Before we go ahead into explaining the step method for our environment, there is a prerequisite method we must call so that the data is prepared in the right manner internally for training and inference.

The process_data method must be called on our environment as it splices the data into the right window which we need based on the frame_bound variable earlier described, and scales the data based on the scalers available. The end_tick required for the condition to end each training loop is defined.

def process_data(self):    signal_features = []    for i in range(self.assets):        df = self.dfs[i]        start = self.frame_bound[0] - self.window_size        end = self.frame_bound[1]        if self.scalers[i]:            current_scaler = self.scalers[i]            signal_features_i = current_scaler.transform(df.loc[:, self.tech_indicators])[start:end]        else:            current_scaler = StandardScaler()
            signal_features_i = current_scaler.fit_transform(df.loc[:, self.tech_indicators])[start:end]
            signal_features[i] = current_scaler        signal_features.append(signal_features_i)    self.prices = self.price_df.loc[:, :].to_numpy()[start:end]    if self.representative:
        self.representative = self.price_df.loc[:, self.representative].to_numpy()[start:end]
    else:
        self.representative = self.price_df.loc[:, 'SENSEX'].to_numpy()[start:end]    self.signal_features = np.array(signal_features)    self._end_tick = len(self.prices)-1    return self.prices, self.signal_features

Finally, we define the step method which is the most important method in our environment. We set the done condition to false and since we are now taking the action for the next time step we have to update the current_tick.

def step(self, actions):    self._done = False    self._current_tick += 1

The done state is updated if we are at the end of the cycle, this enables the agent to end trading and for the RL model to make updates.

if self._current_tick == self._end_tick:    self._done = True

We get the current prices and since we have to divide the available amount by current prices we also create an array called current_prices_for_division

#Get the current pricescurrent_prices = self.prices[self._current_tick]#handling cases where current price is na and avoiding buying infinite 0 cost stockscurrent_prices[np.isnan(current_prices)] = 0current_prices_for_division = current_pricescurrent_prices_for_division[current_prices_for_division == 0] = 1e9

Then we start processing the actions suggested by the agent, the agent provides a set of scores one for each asset, and we trade for only 33% of the assets the magnitude of which actions are the highest.

#the absolute value distribution of next step portfolioabs_portfolio_dist = abs(actions)# At any point in time we only trade for 33% of the stocks the model is most confident about# the scores for the rest are suppressedN = int(np.round(abs_portfolio_dist.size*0.66))abs_portfolio_dist[np.argpartition(abs_portfolio_dist,kth=N)[:N]] = 0

Next, we update the margin available for trading, this takes into account price changes that may have taken place since the last trading interval.

self.margin = self.reserve + sum(self.portfolio*current_prices)

With the updated margin we then calculate the actions to be taken to change the portfolio

#Normalize the portfolio positions for next stepnorm_margin_pos = (abs_portfolio_dist/sum(abs_portfolio_dist))*self.margin#Calulate the money in the next positionsnext_positions = np.sign(actions)*norm_margin_pos#Change in money value of the positionschange_in_positions = next_positions - self._position#actions to take in the marketactions_in_market = np.divide(change_in_positions,current_prices_for_division).astype(int)

Now we can update our internal representation of the portfolio — in case we are to do a production-grade operation an API call can be fired off at this stage to make the trade. We also update the PortfolioValue, margin, reserve, and cost of trading here.

new_portfolio = actions_in_market + self.portfolionew_pv = sum(new_portfolio*current_prices)new_reserve = self.margin - new_pvprofit = (new_pv + new_reserve) - (self.PV + self.reserve)# calculate the cost of each action in marketcost = self.trade_cost*sum(abs(np.sign(actions_in_market)))

Now we update the variables in the environment:

self._position = next_positionsself.portfolio = new_portfolioself.PortfolioValue = new_pvself.reserve = new_reserve - cost

Finally, we calculate the reward we had received for the action taken previously and after updating some history tracker, return the rewards.

# Calculate the total step reward - profit made this stepstep_reward = profit - costself._total_reward += self.reward_scaling*step_rewardself.rewards.append(self._total_reward)self.pvs.append(new_pv)self._update_profit()self._position = next_positionsself._position_history.append(self._position)observation = self._get_observation()info = dict( total_reward = self._total_reward, total_profit = self._total_profit,)self._update_history(info)if self.margin < 0:    self._done = Truereturn observation, step_reward, self._done, info

To end it all, we also define the render method which for our case plots a figure of the portfolio values in our environment to be later showcased with Matplotlib.

def render(self, mode='human'):    if self._first_rendering:        self._first_rendering = False    plt.cla()    plt.plot(self.pvs)    plt.suptitle(        "Total Reward: %.6f" % self._total_reward + ' ~ ' +        "Total Profit: %.6f" % self._total_profit
)
    plt.pause(0.01)

And that’s it, we have finished defining a custom environment for our trading task. In applying RL to a problem, defining the environment is the most challenging task as there are a lot of design decisions we need to contend with. Gym library makes our life easy by bringing structure to this process but still, it requires domain expertise to understand what data points to track and how to configure the actions.

We can train and infer the environment with the standard policies and RL algorithms from most RL libraries allowing for the Gym-based environment API. Here is the rendering result with the additional market index value plot with the standard NeuralNetwork MLP policy.

An important part of designing an environment is also how the agent is rewarded in our case the overall profit is the reward which is pretty neat. The process of receiving rewards is a very important part of a successful RL agent.

Thanks for reading! As a disclaimer, I would like to explain that no part of this work must be misconstrued as investment advice. Most algorithmic trading systems — even the best of them, lose money when deployed in the market.

The code for this tutorial can be found on my Github repo. If you like the work so far please consider adding a star to the repo as I will continue to develop it further as per the response and please leave a comment below. Now we can move on to implementing a customized policy for our new trading environment.

Applied RL: Customization of RL policies using StableBaselines3

Customizing RL PPO policy architectures by defining a new set of Dense layers in the StableBaselines3 library format.