Reinforcement Learning (PPO)—in an investment environment

7 min readDec 7, 2021

Getting foundations right

Can you code an investment strategy? The answer is probably yes and there are many quant strategies attest to that. However, I consider myself a long term investor and weigh quantitative indicators to identify long-term trends and make strategic trades . While passive index funds have gained popularity recently, scares like in March 2020 caused by COVID-19 has shed light on the benefits of active investing. I would like to consider Reinforcement Learning (RL) as a possible solution to create a long-term portfolio allocator, in particular Proximal Policy Optimisation (PPO). I built a PPO model to interact in a trading environment and there were some promising results. I wanted to see if a naive model can formulate a long-term investment strategy. Although this article hopes to address the plausibility and manageability of assumptions for a PPO agent and trading environment.

Reinforcement learning uses a formal framework defining the interaction between a learning agent and its environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and nondeterminism, and the existence of explicit goals.[1]

The stock exchanges and brokers have a large publicly available archive of data, researched indicators and quantitative analysis which are freely available and in a structured format. Algorithms can compute decision much faster with accurate memory. However models should be applied thoughtfully because there have been many cases before where models have been applied without thorough understanding will lead to poor returns in the long run.

Investing

While the obvious goal of investing being to earn money, the savvy investor will not recklessly throw their money at every opportunity. There are factors such as time horizon and risk appetite (level of potential loss) that many products can offer or are inherent in assets. With that said, this article discusses the applicability of portfolio allocation with long-term returns as the goal. This investment setting will allow us to focus on the price action only and not on other factors such as corporate reports and actions or news. Instead the model may mimic a quantitative trader only trading based on price action.

The model will have less information to trade on but will also avoid the emotion in trades.

RL

… First, there is the emphasis on learning while interacting with an environment, in this case with an opponent player. Second, there is a clear goal, and correct behavior requires planning or foresight that takes into account delayed effects of one’s choices.[1]

RL is a subfield of Machine Learning and there are two main aspects; the environment and agent. The agent receives a state and reward from the environment and returns an action. This sequence iterates continuously until a stop condition is reached. Below is an illustration of one iteration at time, t.

The agent-environment interaction in reinforcement learning

Example

As an example, if the environment is the weather outside your building, and the agent is you.

Then environment inputs/outputs are

state: raining, sunny or snowing.
action: go outside, stay home or read an article
reward: level of your satisfaction (e.g. 5)

and agent inputs/outputs are

state: raining, sunny or snowing.
action: go outside, stay home or read an article
reward: level of your satisfaction (e.g. 1)

As an episode of environment and agent interacting:

agent will observe (state) from the environment. Environment is sunny (state)
agent will choose (action) to going outside. Environment register that the agent chose (action) to go outside
environment will tell the agent what level of (reward) satisfaction to give the agent, the agent will be (rewarded) very satisfied.

Environment

Maybe you are facing an actual slot machine that changes the color of its display as it changes its action values. Now you can learn a policy associating each task, signaled by the color you see, with the best action to take when facing that task — for instance, if red, play arm 1; if green, play arm 2. With the right policy you can usually do much better than you could in the absence of any information distinguishing one bandit task from another.
.. associative search task, so called because it involves both trial-and-error learning in the form of search for the best actions and association of these actions with the situations in which they are best.[1]

Inputs:

A. portfolio weights for a day

Outputs:

B. profit/loss for that day

C. portfolio prices, changes, correlations and price indicators

Agent

The concepts of value and value functions are the key features of the reinforcement learning methods that we consider in this book. We take the position that value functions are essential for efficient search in the space of policies. Their use of value functions distinguishes reinforcement learning methods from evolutionary methods that search directly in policy space guided by scalar evaluations of entire policies.[1]

Inputs:

B. profit/loss for that day

C. portfolio prices, changes, correlations and price indicators

Outputs:

A’. portfolio weights for tomorrow

PPO

PPO is from a subset of reinforcement learning called Temporal-difference (TD) learning. TD learning is a combination of Monte Carlo* ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics.

Some terms to categorise PPO:

model-free — There is no training model provided to the algorithm for it to simulate its own training simulations.[2]
on-policy — There is only one policy being tracked and followed at a time
Actor-Critic — There are two functions to solve, and a neural network is used to solve for both. Actor or policy network acts as a “control function” mapping a likelihood for any action in a given state. Critic or value network mapping values for any action in a given state.

Why PPO would be applicable

Here is a list of how PPO fits:

Agent (PPO)

Policy gradient does well at high dimensional problems
Continuous action space as an output
Scalable on millions of datasets in a reasonable time
In-built optimisation for cumulative long term reward over short term gain
Markov Decision Property, requires perfect understanding of the states to make a valid action. But there is still randomness to the market as prices are floating based on demand and supply. PPO will be using a stochastic approach that estimates returns.

Environment

Large dataset from 2005 for every trading day available
Continuing task that is not episodic but a converging reward function can be computed.

Why PPO would not be applicable

Here is a list of how PPO doesn’t fit:

Agent (PPO)

May take intractable amount of time to find a reasonable solution
Strategy would only be optimising for the data provided (e.g. daily, monthly price movements). There would not be a robust solution

Environment

Historical performance does not assure future returns
Does not consider corporate events, trending trades (e.g. Tesla) or news
The environment is mapped based on the Markov Decision Policy which assumes stationary states. But because of seasonality and inflation prices the trading environment is not stationarity. An assumed tailored set of indicators (e.g. daily change, MACD) is used so that portfolio allocation is sufficient to set a stationary problem to be solved.

Conclusion

The RL model is very similar to how a normal trader interacts with his online broker. A trader can reviews the day’s indicators at his choosing and make a decision on how he wants to allocate his resources to maximise returns by the end of the day. Implementation of PPO mimics the agent as the trader and the environment as an online broker very similarly. Ease of access to large amounts of trading data and stochastic approach of modelling trades make the model and trading environment such a great fit.

Here is the results of the AI portfolio allocator I trained with PPO.

Disclaimer: I am not an investment advisor. This is not to be considered as a financial advice for buying or selling of stocks, bonds or dealing in any other securities.Conduct your own due diligence, or consult a licensed financial advisor or broker before making any and all investment decisions.

About the author Currently studying Data Science at General Assembly. I am interested in macro trading and want to learn more about Machine Learning applications in investment strategy. You can reach out to me at changjulian17@gmail.com or https://www.linkedin.com/in/julian-chang/

References

[1] Reinforcement Learning:An Introduction
[2] Reinforcement Learning algorithms — an intuitive overview

*Monte Carlo means that an entire episode must be completed before a value is assigned to the episode, then an adjustment to the value function can be calculated. This approach has drawbacks as it does not assign value to the iterative steps within the episode.