Dynamic Fee Mechanism Simulation with Reinforcement Learning

Jeffrey Lim

Published in

DECON Simulation

10 min readSep 11, 2019

Introduction

Simulation step 1,2,3

Conclusion

Further works & Limitations

Appendix

Exchange fees are known to have a significant impact on traders’ decisions/actions. When fees are low, there is an incentive to participate in the transaction even when you think profits are low, and vice versa.

Binance currently has 0.1% commission fee rate on all transactions. However, 0.1% fee rate might not be the optimal fee rate. We thought about what would happen if this fee could change flexibly. When the transaction is not happening, the commission fee rate is cut down and the transaction will be more activated. On the other hand, if the market is overheated, the flexible fee mechanism is expected to stabilize the transaction by raising the commission fee rate.

To be specific, what we want to know is how users will react to the network when applying different fee-setting mechanisms. In the simulation, a machine learning agent is put in place, and he aims to maximize his profit according to the reinforcement learning methodology. The goal of this project is to study how the fee mechanism should be set by observing how each agent responds when exposed to different fee policies and analyzing its decisions.

All source code is available at https://github.com/deconlabs/Binanace_trading_simulation. There is also a two-minute description of the project at https://youtu.be/kBjv4KmkEHU.

Introduction

In this simulation, we study how the total commission and volume levels differ when the RL agent who learned to maximize its profits is exposed to different commission mechanisms.

First, agents learn when to buy and sell tokens to maximize profits according to price and volume at a certain time in the environment without fee rate. The agents then have time to revise their strategies by learning 500 episodes in different environments with different fee policies.

We take a deep dive into how the total trading volume of agents in each environment changes and the total fees they have paid.

The simulation consists of three steps, and this post discusses how each process works.

RL agent learning in the basic fee environment
Transfer learning in different fee environment
Observation of the agent’s behavior in each environment

Key findings

Simple increase/decrease in fee rate does not significantly affect the behavior of the RL agent. The change from 0.003 to 0.005 is meaningless.
The dynamic fee mechanism has a significant impact on the RL agent’s policy, increasing total transaction volume and total fee income more than in static fee policy.

Step 1: Learning trading algorithm in basic fee environment

Trading Algorithm

In reinforcement learning algorithm, the agent learns what action is best rewarded in the given state of the environment. The process is illustrated in the figure below.

Reinforcement Learning Illustration (https://i.stack.imgur.com/eoeSq.png)

In this simulation, the agent is a trader who aims to maximize profits through the sale of liquid assets, and the environment is the market. When the agent places order of buy and sell, the environment would inform the new asset price and volume. The agent will observe the current market situation and learn what is the best behavior.

Reinforcement Learning taxonomy as defined by OpenAI [Source]

The RL algorithm is divided into Model-Free RL and Model-Based RL. The presence of Model means that the model provides the probability of transitioning from the current state to the next state. In this simulation, we assume that the stock price is random-walk and cannot know the probability of changing stock price. Therefore, we used model-free RL . Among them, PPO algorithm of policy optimization series and DQN of Q-Learning series were used. More specifically, we used the Rainbow algorithm which integrates several improvements from DQN. Furthermore, we combined the idea of a Transformer with Rainbow. Transformer model which replaces all rnn, cnns into so-called attention head is recently used in various fields.

In summary, the agent was trained using three learning algorithms.

1. PPO (https://arxiv.org/abs/1707.06347)

2. Rainbow (https://arxiv.org/abs/1710.02298)

3. Transformer(https://arxiv.org/abs/1706.03762)

Agent’s learning Flow: Different Trading Algorithm uses different Model structure

Model Free RL is divided into Policy Gradient based method and Value based method. PPO is the most popular algorithm in Policy Gradient method and Rainbow is one of the most popular algorithms in Value method.

PPO directly learns the optimal distribution of each action, while DQN-based method learns how much value each action returns and selects the one that is expected to have the highest value.

Performance comparison in default TradingEnv

The above figure shows PPO performs worse than Rainbow and Attention. This phenomenon occurs because PPO learns by using On-Policy method which means it discards past experience and learns new experience with current policy. On the other hand, DQN-based algorithm learns by using the off-policy method. Off-policy means past experiences considered meaningful are stacked into the memory buffer and reused for learning. The presence of such memory helps the model to fit well to the data, so the performance differs as shown in the figure.

( Of course this well-fitting into data is overfitting. This is a very important issue when making a real-world trading bot. However, the rate of return of the agent itself is not a significant factor in this simulation. In this simulation, we want to see how fees affect agents, not how to maximize the trading performance of agents. (You can improve the trading agent algorithms to create and use your own trading bots. It’s up to you!))

Attention model outperforms Rainbow even though they are the same DQN model. The reason is that the attention mechanism is characterized by focusing on important parts by evaluating the relationship of all periods of the time series data. This means to be more advantageous for extracting latent features from the data.

Agents

Actions

The number of actions you can take depends on n_action_interval. Based on n_action_interval, the smaller number means buying and the larger number means selling. In the code, n_action_interval is set as 5. Action [0,4] is an action to purchase BNB using 20%, 40%, 60%, 80%, and 100% of BTC owned respectively. The number between 6 and 10 means to sells 20%, 40%, 60%, 80%, and 100% of BNB hold respectively.

Difference between agents

In this simulation, 30 agents are created and trained, and the total fee and total transaction volume are observed. Agents have different risk preferences. Some agents will prefer to trade more aggressively in favor of risk, while others will learn more conservative trading to avoid risk. This is implemented by increasing the penalty for negative reward as the risk_aversion_rate is higher.

Trading Env

I customized https://github.com/Yvictor/TradingGym. You can create an environment through stock price data. The environment would pass the data which lengths by ‘obs_data_len’ to the agent at every step. The agent actions according to this state and passes the decision to the environment. This process happens repetitively.

TradingEnv Structure

Making TradingEnv with data

The data is taken from BNB / BTC data from https://api.binance.com/api and is 15 minutes of OHLC data as shown below. Since the fee mainly affects short-selling, we used 15-minute OHLC data, which is a relatively short unit of time.

Data about BNB/BTC. Agents are trained in this environment

The video below is a visualization of the Rainbow Agent trading in the test environment. The yellow box area is the range of price data (= obs_data_len) that the agent observes, the red triangle means buying, and the green inverted triangle means selling. The red area below represents the rate of return. As the test progresses, the red area grows larger.

Step 2: Transfer learning in different fee mechanism

Now, the agents learned through the trading Env and Rainbow algorithms are trained again in the different fee-rate policy environment. Agents will change their optimal policies to accommodate different circumstances. This will result in different total fees and total transaction volume. We expect to find even the same agent can make a different decision under the same price state when fee policies differ.

Fees would change depending on the environment

Trading Env list

No fee
0.003
0.005
Bollinger band bound Environment
RSI bound Environment
MACD bound Environment
Stochastic slow bound Environment

Environments 1, 2 and 3 are traditional trading environments with static fees. On the other hand, environment 4~7 is variational TradingEnv whose fee changes according to the derivative index. (An explanation of each derivative indicator and how fees vary according to the derivative indicators are attached in the Appendix.)

30 agents are trained for 1000 episodes in the default fee environment, then moved to the environment where each dynamic fee is applied and trained again for 500 episodes. As the commission policy is different from the existing environment, the optimal behavior of agents will vary.

Step 3: Observation

We observe how much transaction volume and commission the agent trained in Step 2 will pay during the test.

Total volume & Total Fee

Total Fee and Volume Per Trading Environment

The most notable point is that more transactions occur in a dynamic fee environment than in a static fee environment (0.003, 0.005). Especially in MACD, both transactions and commissions were higher compared to static fee environment (0.003, 0.005). This suggests the possibility that a dynamic fee policy could further stimulate the market and create a favorable situation for the exchange.
This trend is strongest in the MACD policy because the fee change is the steepest. (Refer to fee rate change graph)

Agent’s decision differs under the same state if in different fee

In addition, agents with the same risk aversion rate could make different decisions even under the same OHLC data states if the fees were different. Their different decision showed varying amount of trading volume as shown in the above figure. The transaction volume was the highest in the macd environment, where the fee fluctuation was the largest.

What Feature Agent is focusing on

Let’s see if the agents pay attention to fees. The figure below is a visualization of the agent’s attention to the data using the IntegratedGradient method. It is observed that fee_rate has a significant effect on agent decision making compared to other features.

IntegratedGradient: Shows what feature agent is focusing on

Conclusions

In this simulation, we dived deep into what would happen if the existing static fees were dynamically changed based on derivative indexes. Firstly, reinforcement learning agents were trained in the default fee environment. Secondly, we implemented transfer learning in an environment with different fee mechanism. Lastly, we studied about which action these transfer learned agents chose in each environment.

Experimental results show that the transaction volume has increased and the total fee has increased more than in the static fee (0.003, 0.005) environment. This could be a very favorable signal to exchanges. By changing the fee mechanism, you can generate more transactions and get more fee income.

In this simulation, the fee is simply changed according to the derivative index, but it is expected that if the other various dynamic pricing methodologies are applied, the fee dynamics optimized for exchanges and users could be found. Specifically, in DEX (Decentralized Exchange, as known as DEX), only the several popular tokens are traded, and if these commission dynamics are studied enough, you may find a way to significantly reverse the current situation.

Further works & Limitations

Implement LOB market by modifying agent’s action

In the current environment, it is assumed that a Buy / Sell order can be executed immediately. Currently, this is a limit that occurs because we set an action to determine what percentage of the asset will be placed. If you modify this to the limit order, you will be able to simulate more realistically. If such an environment is implemented, it may be possible to further experiment fee mechanisms such as causing the fee to change according to bid / ask spread.

2. Implementing DEX environment

In DEX, transactions are sparse. However, in our simulation, we simulated a situation where continuous trading takes place assuming a traditional exchange environment. In fact, if you want to improve DEX, you will need to design your simulation environment more like DEX. Since the DEX environment is not fully open yet, it is limited to accurately implement those environments. If you could implement TradingEnv that resembles the DEX environment, it will be really helpful to discover the insights to improve DEX.

3. Multi agent trading

In the current simulation, the mechanism for multi-agents is not applied. We assumed a situation where only one agent deals with the market. In reality, this assumption is not totally wrong because individuals have little effect on the market. However, if the simulation becomes more sophisticated, it may be possible to train multi-agents and make them trade in our environment.

Appendix

A. How to add your environment

Figure out function to influence fee rate

e.g

2. Add your function to ‘fn’. This code snippet is located in ~/envs/trading_env_integrated.py

B. Dynamic Fee Environment Explanation

MACD (Moving Average Convergence & Divergence)
-MACD(= long term moving average- short term moving average) shows the trend of price.
-Link : https://towardsdatascience.com/implementing-macd-in-python-cc9b2280126a

2. RSI(Relative Strength Index))
-It is the ratio of uptrend and downtrend of closing price during certain periods, which shows overbought or oversold signals.
-Link : https://www.investopedia.com/terms/r/rsi.asp#calculation-of-the-rsi

3. Bollinger Bands
- Bollinger Bands is bounded by 2 sigma from simple moving average. The stock market is thought to be oversold or overbought, when the price is out of this boundary.
-Link : https://www.investopedia.com/terms/b/bollingerbands.asp

4. Stochastic Slow
-It is defined as the location of closing price compared with the highest and lowest price during certain periods.
-Link: https://www.fidelity.com/learning-center/trading-investing/technical-analysis/technical-indicator-guide/slow-stochastic