Portfolio Allocation: Reinforcement Learning (PPO) model Part II

7 min readDec 9, 2021

A supervised learning system cannot be said to learn to control its environment because it follows, rather than influences, the instructive information it receives. Instead of trying to make its environment behave in a certain way, it tries to make itself behave as instructed by its environment. [Reinforcement Learning:An Introduction]

Problem Statement

Traditional investors are moving away from 60/40 equities/fixed income strategies to provide a long term return. Although there are more assets than equities and fixed income. The trademarked Dragon portfolio strategy proposes a multi-generational investment that utilises investment into diversified assets. This portfolio integrates popularly traded asset classes i.e. equities, fixed income, commodities, gold, long volatility. The main objective of the portfolio is to provide stable wealth accumulation for 100 years.

To understand the Dragon portfolio better I implemented the use of Modern Portfolio Theory with python code to allocate a multi-asset portfolio. It even allows for an investor to adjust their portfolio based on risk. However MPT is an aggregated metric, using a fixed timeframe and assumes returns have a normal distribution. Ironically that layers more risk. There is still room to improve multi-asset trading.

This led me to think: “I can’t memorise, synthesise and weigh all asset performance to allocations, but a model would”. Combining my interest in portfolio management and data, this project assessed the performance of Deep Reinforcement Learning (RL) in an investment environment. Model performance was compared against the Dragon portfolio with 4.7% outperformance in simulations from 2020 to 2021.[Github Repository here]

Part I discussed states, actions, rewards, policy gradient procedure for this model.

Portfolio Allocation Agent

Why PPO over other RL policies?

The appropriate agent that suits the environment is important. Here are the settings for the portfolio allocation problem,

a. state space: covariance matrices and indicators are continuous variables
b. action space: portfolio allocation is continuous

There are three RL model categories to choose from:

model-based: is tedious to create an environment
model-free off-policy: works best with discrete action space though it can be discretised
model-free on-policy: uses policy gradient and sample inefficient

Model-free is popular with many resources available in particular FinRL library to use. With regards to the off vs on policy models; off-policy models will require discretising a continuous space which may be an interventionist approach, but allowing a flexible on-policy function can devlop its own discretised policy and action space. The greedy approach is taken with policy gradient . Finally PPO is chosen with the following advantages (1) integration of Deep learning concepts (Actor-Critic Networks) (2) stable iterations (3) ready to use ‘out of the box’ .

How does PPO work?

Unlike off-policy algorithms like Deep Q learning (DQN), PPO doesn’t save all past experiences and associated rewards. However replicating the greatest short term returns (based on previous memory) should not be the intention because it may result in overfitting to historical transactions. In a trading context, returns are non-stationary (due to price discovery) so trades must balance expected return and potential loss. A good trader would weigh up not only historically similar conditions but how it is reflected against time horizon and capital.

PPO categorisation and basic assumptions for PPO and investment environment are here. But succinctly PPO is an on-policy algorithm this means it assumes one strategy that improves throughout its training:

When initialised it sets a random policy
Runs the policy for some episodes to gain experience
Finds out which aspects of the policy did very poorly or well based on a separate value function
Marginally update policy based on lessons from previous episodes

In this way it provides estimates for policy (Actor network) and value function (Critic network). Step 4. above uses gradient descent on the formula :

PPO paper [2]

The objective functions consists of 3 components:

Value “Lt,CLIP” — Critic Network
Policy “LtVF” — Actor Network
Entropy “S” — Uncertainty

Training

Challenges

Stationarity. As with most modelled time-series problems stationarity is important when the model is not designed to perform time-series regressions. Choosing stationary indicators and relative price representation is necessary.
Reward. In supervised learning instructional feedback is provided to the agent, but reinforcement learning uses reward and punishment as signals for positive and negative behaviour (as a series of actions).
- For this project the reward is set to the net of the daily return of the agent and the dragon portfolio.
- For this project transaction fees are not taken into account for the reward or the computation of total returns
Exploration vs Exploitation. Exploration is very important for DRL agents because it does not have any context when acting on the environment. The model did not explore the current solution until the entropy coefficient was .01 (high) and total_timesteps was 10,240,000 (7 hours running). Tensorboard is online here. After sufficient exploration then we can exploit the experience during the prediction phase.
Overfitting. PPO and stochastic policy gradient descent problems are notorious for overfitting. This is the case here where the model had exceptional returns in the training period but disproportionately less in the test environment. This may be because of the size of the learning rate and clipping ratio resulting in reduced iterative steps in the gradient. This would result in fitted weights of trained states but under fitting unseen dimensions and states. For example, a solution for a trade within a day in the training set may be relevant for a full month of return in the test set. A possible solution proposed is to run the training for longer.

Unlike Supervised learning, there isn’t a correct answer. The nature of RL is for agents to study the environment extensively and develop an optimal policy. However returns from asset markets is non-determinist or untimely. Even if the PPO model returned the theoretical maximum returns in the test period, it would prove that the model is not generalised to the market environment.

Results

The RL model can be compared to a ‘random trader’ agent which allocates capital to assets randomly. The initial policy for PPO is random so if the model is consistently outperforming or more stable than the random agent we know that the model is learning from its past experiences. However the model would be useless if it wasn’t a close substitute of the prevailing strategies.

Summary table of returns (first 3 rows in percent) from 2020 to 2021 August

Well as it turns out the trained model was able to beat Dragon portfolio over the test period of 2020 to 2021 Aug. This PPO model simulated a 24.2% return compared to 19.5%, for 4.7% of outperformance (2.6% p.a.).

cumulative return (in percent) from 2020 to 2021 August

Conclusion

Policy-based RL have one significant drawback of local and not global maximums. In this case, the trained model used returns benchmarked to the Dragon portfolio. But the S&P 500 index fund still outperforms the Dragon Portfolio. There is a theoretical optimal policy that could beat the S&P 500 index fund. However the solution is to optimise a 100-year multigenerational strategy. This model of a RL agent it outperformed the Dragon portfolio.

Improvements worth attempting

Since PPO is a stochastic gradient descent problem, having more optimisation is an advantage. Multiprocessing would make it possible to optimise multiple policies at once so an aggregate can be bootstrapped and also allow for more sample optimisation hopefully in a shorter time.

Current implementation has a ‘vanishing gradient’ issue where previous experiences are forgotten so the model lacks a meaningful context of the trade environment on the day besides the provided states. This setup is not helpful for medium to short-term performance. For example, the model will forget significant events even if it was a day or week ago. Using Recurrent Neural Network (LSTM) to would be more sample efficient, even permitting the model to to plan short-medium term strategies.

Final remark: I started my Data Science journey in June-2021 and I didn’t think I’d work with anything as complex as Reinforcement Learning / Neural Network algorithms. But what surprised me is the fact that I can code infrastructure for a value theory so that a model can shape its own value function. Value functions that us as humans follow, struggle to formulate or choose to ignore.

About the author Currently studying Data Science at General Assembly. I am interested in macro trading and want to learn more about Machine Learning applications in investment strategy. You can reach out to me at changjulian17@gmail.com or https://www.linkedin.com/in/julian-chang/

Disclaimer: I am not an investment advisor. This is not to be considered as a financial advice for buying or selling of stocks, bonds or dealing in any other securities.Conduct your own due diligence, or consult a licensed financial advisor or broker before making any and all investment decisions.

References

[1]Markov Decision Policy: Solving for optimal value and policy
[2]PPO Paper
[3]PPO in stocks
[7]Hyperparameter tuning results and parameter descriptions