Portfolio Allocation: Reinforcement Learning (PPO) model Part I

Julian Chang
6 min readDec 9, 2021

--

Tuning PPO parameters for investing

Portfolio management involves building and overseeing a selection of investments that will meet the long-term financial goals and risk tolerance of an investor. Coincidentally, Reinforcement Learning (RL) tends to mimic this process with the Markov Decision Policy and Bellman’s equations. And yet the limitation is not in the model but on its environments and inputs. This article will describe the Proximal Policy Optimisation (PPO) model parameters used to train and trade eight asset classes to beat an established investment strategy targeted for long-term multi-generational wealth.

The model above aims to beat the multi-asset long-term investment strategy of the Dragon portfolio was built and backtested. Github repository here and some basic applicability review of RL and PPO here.

Libraries

  • FinRL is the first open-source framework to demonstrate the great potential of applying deep reinforcement learning in quantitative finance. We help practitioners establish the development pipeline of trading strategies using deep reinforcement learning (DRL).
  • Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines.

States, Actions, Rewards

what are the model states?

State: Covariance Matrix

State consists of (above) covariance matrix of each asset columns 2–9 in figure and (below) technical indicators there after columns 10–20 in figure below

State: Price Indicators

The Markov Decision Policy is the basis of PPO and requires that the states are stationary. However since price action is subject to long term cycles and seasonality, this is not the case. The below indicators (a)-(d) are oscillating indicators, (f) covariance is non dimensional and (e),(g) are the basic state-space conditions for a trajectory.

a. Moving Average Convergence Divergence (MACD).
b. Relative Strength Index (RSI)
c. Commodity Channel Index (CCI)
d. Directional Movement Index (DMI)
e. 30 and 60 day Simple Moving Average (SMA)
f. Covariance Matrix stock indicators
g. Price Close, 30 Day SMA, 60 Day SMA deltas. Is the change in closing price, 30 Day SMA and 60 Day SMA from the previous day
h. month of state

what are the model actions?

Actions

Action space will be the portfolio allocations each day. Each trading day assets are allocated based on the actor network outputs of the softmax layer and scaled so that the proportions of the portfolios always sum to 1.

what are the model rewards?

The objective of agent is to get maximised total returns of an episode. Where one option is to use daily return as the reward, it may be ‘reward rich’ and detrimental to long-term rewards or prematurely settle on a local maxima. For this project the reward is set to the net of the daily return of the agent and the dragon portfolio. Formula for daily (step) reward:

reward = (% agent’s daily return) - (% dragon portfolio’s daily return)

PPO Agent

From PPO Paper[3]

Policy Gradient process

  1. Start learn():
    a. Set random weights to networks
    b. Step forward then actions passed to environment
    c. Agent retrieves reward for last actions. Compute advantage estimates
    d. Repeat a-c to gather data to fill rollout buffer (saved in log object)
  2. When rollout is full run train()
    a. Portion mini-batch of samples
    b. forward pass, partial differentiate objective function
    c. backward pass with learning rate. Repeat until rollout is used up
    d. repeat b-c until epoch is complete. Then discard rollout buffer.
  3. Repeat step 3 for the number of epochs required

*See [4] parameter_descriptions sheet for PPO parameter descriptions

Hyperparameter Tuning

Baseline References: Articles AurelianTactics and unity-ml-agents was used as a guide and was very helpful in understanding the ranges of PPO coefficients.

Regularisation: In some OpenAI gym examples a learning rate schedule is used. However, the clip fraction (fraction of actor weights that were clipped as a result of the PPO clip function) decreased exponentially. I concluded that the clip fraction is also another regulariser and it should be sufficient to reduce large gradient steps. Therefore, instead of a learning rate schedule a clip range schedule may be used although it was not used in this model.

Distribution of asset allocations
Distributions of models’ asset allocations, learning rate: .05 (model 11_5)
Distributions of models’ asset allocations, learning rate : .005 (model 11_1)
Distributions of models’ asset allocations, learning rate: .0005 (model 11_6)

As an aside: Left illustrates 3 hyperparameter trials, where learning rate decreases from the top to bottom graph. The 25th to 75th quartile range and tails of asset allocations decreased.

This shows how important learning rate is an important parameter for exploration in on-policy algorithms.

C. Evaluation: In this case years 2020 to 2021 is used. Unlike, supervised learning there is no commonly used metric. Although the total return is a good gauge of performance it is not used the same as accuracy or entropy scores. For example the policy solving for the maximum theoretical returns may not be the best generalised policy because daily price movements are non-deterministic.

Additionally, the model is stochastic, this assures any asset has representation in the portfolio. This feature is useful for exploration and avoiding overfitting. However the probability density function makes consistent results impossible. Although stable-baseline3 has a predict() method with deterministic results which will be used for the model[5].

Reward: Although not extensively tested, three variations of stepwise reward was used (a) portfolio value, (b) daily return as a reward, (c)daily return minus Dragon Portfolio returns. (a) did not do very well but (b) and (c) are comparable.

Total Time steps: Since the state and action space is infinite it is important to have a large batch size and total timesteps to accommodate that size of updates. It should be made clear that RL and NN algorithms a long training period to get meaningful results which is fair given the lack of a formal model.

Conclusion

There could be a lot of time used trying to hypertune Reinforcement Learning algorithms. It would be advisable that if your model is not using automated search that some recommended parameters so some meaningful models can be trained. These articles (AurelianTactics and unity-ml-agents) were a massive help. Otherwise if you are ever in doubt, training longer is not going to do any harm, especially if there is constant exploration.

About the author Currently studying Data Science at General Assembly. I am interested in macro trading and want to learn more about Machine Learning applications in investment strategy. You can reach out to me at changjulian17@gmail.com or https://www.linkedin.com/in/julian-chang/

Disclaimer: I am not an investment advisor. This is not to be considered as a financial advice for buying or selling of stocks, bonds or dealing in any other securities.Conduct your own due diligence, or consult a licensed financial advisor or broker before making any and all investment decisions.

--

--

Julian Chang

Data Scientist Student (Singapore, 🇸🇬) Looking for a job as a Data Analyst / Scientist