Create a custom gym environment for trading — Bitcoin Binance trading

5 min readDec 13, 2019

OpenAI’s gym is by far the best packages to create a custom reinforcement learning environment. It comes with some pre-built environnments, but it also allow us to create complex custom environments. An environment contains all the necessary functionality to run an agent and allow it to learn.

Our goal is to recreate Binance with the same fees, the same data, and let our agent learn from these. If you do not have a Binance account you can create one by clicking here. A common mistake is to make environments who do not match reality and are to easy for our agents to learn, it’s what we will try to avoid here.

It’s really hard to make money on the stock market/cryptocurrency market. If our bot make +10% per month, it’s probably not sustainable or just pure luck. Being able to trade and keeping our money with paying our fees is already really impressive, let’s do our best !

All of the code for this article is available on my GitHub.

The custom environment

First let import what we will need for our env, we will explain them after:

import matplotlib.pyplot as pltimport numpy as npimport gymimport randomfrom gym import spacesimport static

A custom environment is a class that inherits from gym.env.

class CryptoEnv(gym.Env):def __init__(self, df, title=None):self.df = dfself.reward_range = (-static.MAX_ACCOUNT_BALANCE,static.MAX_ACCOUNT_BALANCE)self.total_fees = 0self.total_volume_traded = 0self.crypto_held = 0self.bnb_usdt_held = static.BNBUSDTHELDself.bnb_usdt_held_start = static.BNBUSDTHELDself.episode = 1# Graph to renderself.graph_reward = []self.graph_profit = []self.graph_benchmark = []# Action space from -1 to 1, -1 is short, 1 is buyself.action_space = spaces.Box(low=-1,high=1,shape=(1, ),dtype=np.float16)# Observation space contains only the actual price for the     momentself.observation_space = spaces.Box(low=0,high=1,shape=(10, 5),dtype=np.float16)

df : The dataframe that we have created in the past article

reward_range : Not really usefull but needed, let’s make it be between 2 huge numbers (static is the file where we stored all of our constants). Let’s make it between -10 millions and +10 millions.

total_fees : Keep track of the total fees paid

total_volumes_traded : Keep track of the the total trading volume

crypto_held : Keep track of the crypto held (Bitcoin in our case)

bnb_usdt_held, bnb_usdt_start : To track our USDT ( We can easily change the pair)

episode : The number of our current episode (start at 1)

graph_reward, graph_profit, graph_benchmark are used to render the result.

action_space : We set our action space between -1 and 1. 1 means using 100% of our USDT to buy BTC, 0 means doing nothing, -1 means selling all of our BTC for USDT.

The method reset()

def reset(self):self.balance = static.INITIAL_ACCOUNT_BALANCEself.net_worth = static.INITIAL_ACCOUNT_BALANCE + static.BNBUSDTHELDself.max_net_worth = static.INITIAL_ACCOUNT_BALANCEstatic.BNBUSDTHELDself.total_fees = 0self.total_volume_traded = 0self.crypto_held = 0self.bnb_usdt_held = static.BNBUSDTHELDself.episode_reward = 0# Set the current step to a random point within the data frame# Weights of the current step follow the square functionstart = list(range(4, len(self.df.loc[:, ‘Open’].values) —    static.MAX_STEPS)) + self.df.index[0]weights = [i for i in start]self.current_step = random.choices(start, weights)[0]self.start_step = self.current_stepreturn self._next_observation()

We decide to start our current_step to a random point in our dataframe. But not totally random, we choose it so that the older the data the least it is choosed at a starting point. Make sense right ? What append yesterday has more impact then what append 2 years ago.

The method _next_observation_()

def _next_observation(self):# Get the data for the last 5 timestepframe = np.array([self.df.loc[self.current_step — 4:self.current_step, ‘Open’],self.df.loc[self.current_step — 4:self.current_step, ‘High’],self.df.loc[self.current_step — 4:self.current_step, ‘Low’],self.df.loc[self.current_step — 4:self.current_step, ‘Close’],self.df.loc[self.current_step — 4:self.current_step, ‘Volume’],self.df.loc[self.current_step -4:self.current_step, ‘Quote asset volume’],self.df.loc[self.current_step -4:self.current_step, ‘Number of trades’],self.df.loc[self.current_step -4:self.current_step, ‘Taker buy base asset volume’],self.df.loc[self.current_step -4:self.current_step, ‘Taker buy quote asset volume’]])# We append additional dataobs = np.append(frame, [[self.balance /  static.MAX_ACCOUNT_BALANCE,self.net_worth / self.max_net_worth,self.crypto_held / static.MAX_CRYPTO,self.bnb_usdt_held / self.bnb_usdt_held_start,0]],axis=0)return obs

We take here the data we want our agent to know before making the decision to buy or sell. We decide to give the 5 last timeframes, so our agent will know the Open of the last 5 timesteps (he will also know other things like the Close, the Volume,…). We also pass him our current balance, net worth and the amount of crypto held.

The method _take_action()

def _take_action(self, action):# Set the current price to a random price between open and closecurrent_price = random.uniform(self.df.loc[self.current_step, ‘Real open’],self.df.loc[self.current_step, ‘Real close’])if action[0] > 0:# Buycrypto_bought = self.balance * action[0] / current_priceself.bnb_usdt_held -= crypto_bought * current_price *  static.MAKER_FEEself.total_fees += crypto_bought * current_price * static.MAKER_FEEself.total_volume_traded += crypto_bought * current_priceself.balance -= crypto_bought * current_priceself.crypto_held += crypto_boughtif action[0] < 0:# Sellcrypto_sold = -self.crypto_held * action[0]self.bnb_usdt_held -= crypto_sold * current_price * static.TAKER_FEEself.total_fees += crypto_sold * current_price * static.TAKER_FEEself.total_volume_traded += crypto_sold * current_priceself.balance += crypto_sold * current_priceself.crypto_held -= crypto_soldself.net_worth = self.balance + self.crypto_held * current_price + self.bnb_usdt_heldif self.net_worth > self.max_net_worth:self.max_net_worth = self.net_worth

This method make us buy or sell depending on the action taken and calculate the new net_worth.

The metod step()

def step(self, action, end=True):# Execute one time step within the environmentself._take_action(action)self.current_step += 1# Calculus of the rewardprofit = self.net_worth — (static.INITIAL_ACCOUNT_BALANCE +static.BNBUSDTHELD)profit_percent = profit / (static.INITIAL_ACCOUNT_BALANCE +static.BNBUSDTHELD) * 100benchmark_profit = (self.df.loc[self.current_step, ‘Real open’]   / self.df.loc[self.start_step, ‘Real open’] - 1) * 100diff = profit_percent — benchmark_profitreward = np.sign(diff) * (diff)**2# A single episode can last a maximum of MAX_STEPS stepsif self.current_step >= static.MAX_STEPS + self.start_step:end = Trueelse:end = Falsedone = self.net_worth <= 0 or self.bnb_usdt_held <= 0 or endif done and end:self.episode_reward = rewardself._render_episode()self.graph_profit.append(profit_percent)self.graph_benchmark.append(benchmark_profit)self.graph_reward.append(reward)self.episode += 1obs = self._next_observation()# {} needed because gym wants 4 argsreturn obs, reward, done, {}

This method calculate our reward. The choice of the formula is crucial for our bot, it depends to the profit we have made (easily understandable) but it also depends on the benchmark profit. It is made that way because it is easy for our agent to make +1% when the BTC make +10% and it is hard to keep our money when BTC goes -10%. We have to reward the agent when he choose the best solution not when he make money.

Render the choice of our agent

We have make 2 method that render, one render a summary of our balance, crypto held and profit for each step and one render at the end of each episode. We also plot a graph to have a a better visualisation.

Conclusion

We have made a environment close to the Binance site, we did not forget fees that can be change on the static.py file. The reward function can be tested and changed if we find a better one but this one will do the work for the moment.

Next article where we will learn to train our agent using the environment we created and the data we made earlier.