Portfolio Optimization using Reinforcement Learning
Experimenting with RL for building optimal portfolio of 3 stocks and comparing it with portfolio theory based approaches
Reinforcement learning is arguably the coolest branch of artificial intelligence. It has already proven its prowess: stunning the world, beating the world champions in games of Chess, Go, and even DotA 2.
Using RL for stock trading has always been a holy grail among data scientists. Stock trading has drawn our imaginations because of its ease of access and to misquote Cardi B, we like diamond and we like dollars 😛.
There are several ways of using Machine Learning for stock trading. One approach is to use forecasting techniques to predict the movement of the stock and build some heuristic based bot that uses the prediction to make decisions. Another approach is to build a bot that can look at the stock movement and directly recommend the actions — buy/sell/hold. This is a perfect use-case for reinforcement learning as we will generally know the accumulated results of our actions only at the end of the trading episode.
I will be formulating this as a portfolio optimization problem :
Given histories of 3 different stocks, how would we allocate a fixed amount of money between these stocks every day so that maximize the likelihood of returns.
The objective is to develop of policy (strategy) for building a portfolio. The portfolio is essentially an allocation of available resources across various stocks. The policy then needs to restructure the portfolio over time as new information becomes available.
Here the policy should be able to pick the optimal portfolio (allocation).
Our solution is to develop a reinforcement learning model — an agent that allocates stocks at every time step by observing the indicators for each stock. We then compare this RL policy with Markowitz’ efficient frontier approach — which along with “gut feel” is perhaps the approach employed by most asset managers.
Quick note on Reinforcement Learning:
Reinforcement Learning deals with designing “Agents” that interacts with an “Environment” and learns by itself how to “solve” the environment by systematic trial and error. An environment could be a game like chess or racing, or it could even be a task like solving a maze or achieving an objective. The agent is the bot that performs the activity.
An agent receives “rewards” by interacting with the environment. The agent learns to perform the “actions” required to maximize the reward it receives from the environment. An environment is considered solved if the agent accumulates some predefined reward threshold. This nerd talk is how we teach bots to play superhuman chess or bipedal androids to walk.
We would be designing an agent that uses some strategy to interact with a trading environment to maximize the value of the portfolio. Here, the actions would be the decision of the agent on what portfolio to maintain (e.g. 30% stock A,30% stock B,30% stock C,10% Cash split). The agent then receives a positive or negative reward for that action (portfolio allocation). The agent iteratively modifies its strategy till it figures out the best action for a given state of the environment.
Experiment Set Up:
I designed a custom environment to simulate the actual trading process. The agent can interact with the environment in the following manner:
- The environment provides observations of its current state — indicators for the 3 stocks
- The agent passes an action to the environment. The action is the proposed portfolio allocation — e.g. 10% of total value in cash, 30% in stock 1, 30% in stock 2 and 30% in stock 3
- The environment changes state by one time step and returns the new state, and the reward(change in value) associated with the previous portfolio
Steps 1 through 3 repeats until the episode is completed. The sum of the rewards obtained at the end of each step is the total reward. The objective is to maximize the total reward at the end of the episode.
Size of an episode is set to 500 time steps. This is randomly sliced from a dataset of 650,000+ time steps. Every time the environment is initialized, a different section of the full dataset is selected. This would prevent agent from memorizing environment. Every run of the environment would be different. Further, training and evaluation of the agents are done on different environments. So the agent learns a policy from one slice of the data. The policy is then evaluated on a different slice of the dataset.
We now evaluate RL algorithm and Markowitz’ model using this set up.
Here we would be using an off-the-shelf untuned lazy implementation of Actor Critic model. We would be using tf-agents framework published by tensorflow for this. Please checkout my github for complete code and details of training.
Evaluating over 100 runs of the environment,
Average returns : +20%
Markowitz’ Efficient Frontier
This approach proposes a framework to evaluate the risk and returns of a portfolio.
Return of a portfolio is the mean returns per time step we can expect from that portfolio.
Risk is the standard deviation of the daily return. This gives a measure of volatility of the stock.
By plotting each portfolio in terms of its risk and return, asset managers can make informed decisions on the investment.
The line of efficient frontier shows the portfolios with highest returns for a given risk profile.
For our evaluation, we designed an agent to pick a moderate risk high reward portfolio from a efficient frontier graph calculated at every time step based on previous 30 time steps’ performance.
Average returns: -1%
We can see here that the efficient frontier doesn’t seem to be effective for the stocks we have picked. It’s probably because of the high volatility of the stocks we picked.
Below is a side-by-side comparison of the 2 policies on the same environment:
- RL grows portfolio to 160% , Markowitz’ shrinks to 96%
- We see that both algorithms allocate significant amount on Stock 3 — It’s because the value of stock 3 is very low and stable. So a small gain in value can result in large return(%) without risking volatility.
- We see that during times increased volatility or when all stocks are going down, the RL decides to hedge the losses by selling the stocks and increasing the cash in hand — very smart strategy when we haven’t enabled a short-sell option.
- In general, the RL strategy seems to be to identify bursts of small surges in price and capitalize on that immediately.
We see that RL consistently outperforms Markowitz’ approach in our experiments.
Needless to say, these results are to be considered as anecdotal. These experiments were performed with unrealistic assumptions and on a small carefully selected sample space — which do not properly approximate a real world trading. There are several considerations we have ignored like time lags involved in transactions, transaction costs, short selling, hedging losses and many many more.
Github Repo: https://github.com/kvsnoufal/portfolio-optimization
Shoulders of giants
- Amir - for guidance and advise
Disclaimer: I am not an investment advisor. This is not to be considered as a financial advice for buying or selling of stocks, bonds or dealing in any other securities.Conduct your own due diligence, or consult a licensed financial advisor or broker before making any and all investment decisions.
About The Author
I work in Dubai Holding, UAE as a data scientist. You can reach out to me at email@example.com or https://www.linkedin.com/in/kvsnoufal/