Option Pricing Using Reinforcement Learning

Published in

The Startup

5 min readAug 16, 2020

This post demonstrates how to use reinforcement learning to price an American Option. An option is a derivative contract that gives its owner the right but not the obligation to buy or sell an underlying asset. Unlike its European-style counterpart, American-style option may exercise at any time before expiry.

American Option is known to be an optimal control MDP (Markov Decision Process) problem where the underlying process is Geometric Brownian motion ([1]). The Markovian state is a price-time tuple and the control is a binary action that decides on each day whether to exercise the option or not.

The optimal stopping policy looks like the figure below, where the x-axis is time and the y-axis is the stock price. The curve in red is commonly called the optimal exercise boundary. On each day, if the stock price falls in the exercise region that is located above the boundary for a call or below the boundary for a put, it is optimal to exercise the option and get paid by the amount between stock price and strike price.

One can imagine it as a discretized Q-table as illustrated in dotted grids. Every day the agent or the trader looks up the table and take action according to today’s’ price. The Q-table is monotonous in that all the grids above the boundary yield a go-decision and all the grids below yield a no-go decision. Therefore Q-learning suits well to find the optimal strategy that is defined by this boundary.

The remainder contains three sections. In the first section, a baseline price is computed using classical models. In the second section, an OpenAI gym environment is constructed, similar to building an Atari game. and then in the third section, an agent is trained with DQN (Deep Q-Network) to play American options, similar to training computers to play Atari games. The full Python notebook is located here on Github.

Section One — Baseline

There are many ways to price an American option, from for example binomial tree to Longstaff-Schwartz Monte Carlo methods. Here I use QuantLib package to price a one-year American put option.

pricing_dict = {}bsm73 = ql.AnalyticEuropeanEngine(bsm_process) european_option.setPricingEngine(bsm73) pricing_dict['BlackScholesEuropean'] = european_option.NPV()analytical_engine = ql.BaroneAdesiWhaleyEngine(bsm_process) american_option.setPricingEngine(analytical_engine) pricing_dict['BawApproximation'] = american_option.NPV()binomial_engine = ql.BinomialVanillaEngine(bsm_process, "crr", 100) american_option.setPricingEngine(binomial_engine) pricing_dict['BinomialTree'] = american_option.NPV()print(pricing_dict){'BlackScholesEuropean': 6.92786901829998, 'BawApproximation': 7.091254636695334, 'BinomialTree': 7.090924645858217}

The last line is the output, which says this American option is worth $7.091, while its European counterpart is worth $6.928. This implies an early exercise premium of $0.163.

Section Two — OpenAI Gym Environment

It is standard to derive from OpenAI gym environment. This makes our work expandable to further studies such as exotic options and stochastic volatilities. The underlying theory is the famous Black-Sholes framework and the underlying asset follows Geometric Brownian motion in the risk-neutral world. This is realized in the step function below,

def step(self, action):
    if action == 1:        # exercise
        reward = max(K-self.S1, 0.0) * np.exp(-self.r * self.T * (self.day_step/self.N))
        done = True
    else:       # hold
        if self.day_step == self.N:    # at maturity
            reward = max(self.K-self.S1, 0.0) * np.exp(-self.r * self.T)
            done = True
        else: # move to tomorrow
            reward = 0
            # lnS1 - lnS0 = (r - 0.5*sigma^2)*t + sigma * Wt
            self.S1 = self.S1 * np.exp((self.r - 0.5 * self.sigma**2) * (self.T/self.N) + self.sigma * np.sqrt(self.T/self.N) * np.random.normal())
            self.day_step += 1
            done = False    tao = 1.0-self.day_step/self.N        # time to maturity, in unit of years
    return np.array([self.S1, tao]), reward, done, {}

The AmeriOptionEnv takes action 0 as to hold or not exercise, and action 1 as to exercise. If we stick to the no-exercise policy until expiry, this essentially becomes stock price simulation and serves as input to price European option as control variate.

import matplotlib.pyplot as pltenv = AmeriOptionEnv()
s = env.reset()sim_prices = []
sim_prices.append(s[0])
for i in range(365):
  action = 0        # hold until expiry
  s_next, reward, done, info = env.step(action)
  sim_prices.append(s_next[0])plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.plot(sim_prices)

Section Three — Pricing with DQN

Once the gym environment is constructed, we are ready to price the American option using reinforcement learning, specifically DQN (Deep Q-Network) in this post. Here I use the Tensorflow TF-Agents Library. Alternative choices are other OpenAI Gym compatible libraries such as Pytorch and OpenAI baseline.

The code follows precisely the TF-Agents API document. The only changes I made are importing customized AmeriOption environment and adjusting hyper-parameters that are more pertinent to the one-year option than Cartpole game.

As labelled in Jupyter notebook, the RL model is constructed in the following steps:

Import user-defined AmeriOptionEnv Environment.
Define deep Q-network
Employ a DQN agent
Construct Experience Replay Buffer
Train the agent in episodes
Use the trained policy to price option

According to API, a TF-agent is defined as

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)agent.initialize()

to be aware of the environment states, action space, the deep neural network for policy evaluation, and an optimizer on temporal-difference loss function to do TD-optimization.

The training performance is shown as above. It is rather noisy because the evaluation step uses only 10 simulation paths and is subject to Monte Carlo randomness. For example, we know the option price is around $7 yet the average price can go as high as $12. Therefore, after learning the optimal stopping policy, it is essential to do a full-blown Monte Carlo to find the actual price as below.

import pandas as pdnpv = compute_avg_return(eval_env, agent.policy, num_episodes=2_000)
pricing_dict['ReinforcementAgent'] = npvpricing_df = pd.DataFrame.from_dict(pricing_dict, orient='index')
pricing_df.columns = ['Price']
print(pricing_df)

The Reinforcement learning agent values the price at $7.057, implying an early exercise premium $0.129, This result is in line with classical baseline models.

Conclusion

In this post, we prepare a gym environment and then train a DQN TF-Agent to price an American option. The result is encouraging with a reasonably good price that is in line with classical baseline models. Some improvements include,

For practitioners,

Use a mirror AmeriOption gym environment to provide antithetic variates.
In function compute_avg_return, continue the simulation path to price European option as control variates.

For researchers,

Add another MDP process to capture stochastic volatility.
Instead of using the default network structure, design a specialized multi-layer network to enable transfer learning into other maturities as well as options on rates, futures, FX, and exotic products.

Reference

Shreve, Steven E. Stochastic calculus for finance II: Continuous-time models. Vol. 11. Springer Science & Business Media, 2004.
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.
David Silver UCL Course on RL

Option Pricing Using Reinforcement Learning

Section One — Baseline

Section Two — OpenAI Gym Environment

Section Three — Pricing with DQN

Conclusion

Written by Letian Wang