Deep Reinforcement Learning for Algorithmic Trading

In my previous post, I trained a simple Neural Network to approximate a Bond Price-Yield function. As we saw, given a fairly large data set, a Neural Network can find the underlying statistical relationship between the inputs and the outputs by adjusting the weights and biases in its neurons. In this post, I will go a step further by training an Agent to make automated trading decisions in a simulated stochastic market environment using Reinforcement Learning or Deep Q-Learning which is a form of semi-supervised learning. This was invented by a UK firm Deep Mind and was able to master a diverse range of Atari 2600 games to a superhuman level. You can watch it play by itself here.

Why use AI for algorithmic trading? A vast majority of Algorithmic trading comprises of Statistical arbitrage / Relative Value strategies which are mostly based on convergence to mean, where the mean is derived from a randomly chosen sample of historical data. Algorithmic trading primarily has two components: Policy and Mechanism. The policy is chosen by the traders and the mechanism is implemented by the machines. It has always been a huge challenge to pick the right data sample for that universal spread measure through regression. The issue here is that, “Statistical Arbitrage is not as much of a Regression problem, as it is a behavioral design problem” and it has been understood well but quite poorly implemented. With all the advancement in Artificial Intelligence and Machine Learning, the next wave of algorithmic trading will have the machines choose both the policy as well as the mechanism. Using advanced concepts such as Deep Reinforcement Learning and Neural Networks, it is possible to build a trading/portfolio management system which has cognitive properties that can discover a long term strategy through training in various stochastic environments.


Based on the investment thesis of the mean reversion of the spreads, I will simulate 500 episodes of two mean reverting stochastic processes and train the agent to do a long/short strategy. Think of it as two instruments (stocks or bonds) belonging to the same industry sector which more or less move together and the agent i.e. Neural Net is the trader who will exploit the aberrations in their behavior due to news, earnings report, weather or other macro-economic events by going long on the cheaper instrument and short on the expensive one and vice versa until it reverts back to its mean. In fact, the Neural Net wouldn’t even know about the mean reversion behavior or whether to do a statistical arbitrage strategy or not, instead it will discover this pattern by itself in its pursuit to maximize the rewards/gains in every episode, i.e. it will learn this strategy by itself through trial and error. Once trained in this environment, this agent should be able to trade any two instruments which have a certain co-integrationbehavior and respective volatility range. We can safely assume that the trading volume is small enough so as to have no impact whatsoever on the market. I would like to re-emphasize the importance of generating unbiased data as opposed to using historical market data as I have defined the concept as ‘Smart Datain my previous post.


The first and the most important part is to design the environment. The environment class should implement the following attributes / methods based on the OpenAI / gymconvention:

Init : For initialization of the environment at the beginning of the episode.

State: Holds the price of A and B at any given time = t.

Step: The change in environment after one time step. With each call to this method, the environment returns 4 values described below:

a) next_state: The state as a result of the action performed by the agent. In our case, it will always be the Price of A and B at t = t + 1

b) reward: Gives the reward associated with the action performed by the Agent.

c) done: whether we have reached the end of the episode.

d) info: Contains diagnostic information.

Reset: To reset the environment after every episode of training. In this case, it restores the prices of both A and B to their respective means and simulates new price path.

Its a good practice to keep the environment code separate from that of the agent. Doing so, will make it easier to modify the environment’s behavior and training the agent on the fly. I wrote a Python class called market_env to implement its behavior.

A sample path of 500 time steps for the two assets generated by the environment with A(blue): mean = 100.0, vol = 10% and B(green): mean = 100.0, vol = 20% using the Ornstein–Uhlenbeck process (plotted using python/matplotlib) is shown below. As you can see that the two processes cross each other many times exhibiting a co-integration property, an ideal ground to train the agent for a long-short strategy.

Image for post
Image for post


The agent is a MLP (Multi Layer Perceptron) multi-class classifier neural network taking in two inputs from the environment: Price of A and B resulting in actions : (0) Long A, Short B (1) Short A, Long B (2) Do nothing, subject to maximizing the overall reward in every step. After every action, it receives the next observation (state) and the reward associated with its previous action. Since the environment is stochastic in nature, the agent operates through a MDP (Markov Decision Process) i.e. the next action is entirely based on the current state and not on the history of prices/states/actions and it discounts the future reward(s) with a certain measure (gamma). The score is calculated with every step and saved in the Agent’s memory along with the action, current state and the next state. The cumulative reward per episode is the sum of all the individual scores in the lifetime of an episode and will eventually judge the performance of the agent over its training. The complete workflow diagram is shown below:

Image for post
Image for post

Why should this approach even work ? Since the spread of the two co-integrated processes exhibits a stationary property i.e. it has a constant mean and variance over time and can be thought of as having a normal distribution. The agent can identify this statistical behavior by buying and selling A and B simultaneously based on their price spread (= Price_A — Price_B) . For example, if the spread is negative it implies that A is cheap and B is expensive, the agent will figure the action would be to go long A and short B to attain the higher reward. The agent will try to approximate this through the Q(s, a) function where ‘s’ is the state and ‘a’ is the optimal action associated with that state to maximize its returns over the lifetime of the episode. The policy for next action will be determined using Bellman Ford Algorithm as described by the equation below:

Image for post
Image for post

Through this mechanism, it will also appreciate the long term prospects than just immediate rewards by assigning different Q values to each action. This is the crux of Reinforcement Learning. Since the input space can be massively large, we will use a Deep Neural Network to approximate the Q(s, a) function through backward propagation. Over multiple iterations, the Q(s, a) function will converge to find the optimal action in every possible state it has explored.

Speaking of the internal details, it has two major components:

  1. Memory: Its a list of events. The Agent will store the information through iterations of exploration and exploitation. It contains a list of the format: (state, action, reward, next_state, message)
  2. Brain: This is the Fully Connected, Feed-Forward Neural Net which will train from the memory i.e. past experiences. Given the current state as input, it will predict the next optimal action.

To train the agent, we need to build our Neural Network which will learn to classify actions based on the inputs it receives. (A simplified Image below. Of course the real neural net will be more complicated than this.).

In the above image,

Inputs(2): Price of A and B in green.

Hidden(2 layers): Denoted by ‘H’ nodes in blue.

Output(3): classes of actions in red.

For implementation, I am using Keras and Tensorflow both of which are free and open source python libraries.

Image for post
Image for post

The neural net is trained with an arbitrarily chosen sample size from its memory at the end of every episode in real-time hence after every episode the network collects more data and trains further from it. As a result of that, the Q(s, a) function would converge with more iterations and we will see the agent’s performance increasing over time until it reaches a saturation point. The returns/rewards are scaled in the image below.

Image for post
Image for post

In the above graph, you can see 3 different plots representing entire training scenarios of 500 episodes, each having 500 steps. With every step, the agent performs an action and gets its reward. As you can see, in the beginning since the agent has no preconception of the consequences of its actions, it takes randomized actions to observe the rewards associated with it. Hence the cumulative reward per episode fluctuates a lot in the beginning from 0–300th episode, however beyond 300 episodes, the agent starts learning from its training and and by 400th episode, it almost converges in each of the training scenarios as it discovers the long-short pattern and starts to fully exploit it.

There are still many challenges to it and it is still a part of an ongoing research of engineering both the agent as well as the environment. My aim here was not to show a ‘backtested profitable trading strategy’ but to describe how to apply advanced Machine Learning concepts such as Deep Q-Learning/Neural Networks to the field of Algorithmic Trading. It is an extremely complicated process and pretty hard to explain in a single blog post however I have tried my best to simplify things. Check out dl-algo-trader link for the code.

Furthermore, this approach can be extended into a large portfolio of stocks and bonds and the agent can be trained under diverse range of stochastic environments. Additionally, the agent’s behavior can be constrained to various risk parameters such as sizing, hedging etc. One can also have multiple agents training under different suitability criteria given the desired risk/return profiles. These types of approximation can be made more accurately using large data sets and distributed computing power.

Eventually, the question is, can AI do everything ? Probably no. Can we effectively train it to do anything ? Possibly yes , i.e. with real intelligence, the artificial intelligence can surely thrive. Thanks for reading. Please feel free to share your ideas in the comment section below or connect with me on linkedin .

Hope you enjoyed the post !


1. Opinions expressed are solely my own and do not express the views or opinions of any of my employers.

2. The information from the Site is based on financial models, and trading signals are generated mathematically. All of the calculations, signals, timing systems, and forecasts are the result of back testing, and are therefore merely hypothetical. Trading signals or forecasts used to produce our results were derived from equations which were developed through hypothetical reasoning based on a variety of factors. Theoretical buy and sell methods were tested against the past to prove the profitability of those methods in the past. Performance generated through back testing has many and possibly serious limitations. We do not claim that the historical performance, signals or forecasts will be indicative of future results. There will be substantial and possibly extreme differences between historical performance and future performance. Past performance is no guarantee of future performance. There is no guarantee that out-of-sample performance will match that of prior in-sample performance.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store