Building A Monte Carlo Method Stock Price Simulator With Geometric Brownian Motion And Bootstrap Sampling
Summary
I built a web app using Python Flask that allows you to simulate future stock price movements using a method called Monte Carlo simulations with the choice of two ‘flavours’ : Geometric Brownian Motion (GBM) and Bootstrapped Sampling.
While I may include some programming code below, this article is not going to be a ‘code-along’ tutorial and focuses more on the underlying theory behind GBM and Bootstrap Sampling.
However the native *.*py and a Jupyter Notebook version of the same code is available on this GitHub link below if you want to dig deeper.
Warning ! I am not an investment guru of any sort so I strongly suggest you do NOT use this article as the (sole) basis for your investment decisions.
Outline
- Geometric Brownian Motion (GBM)
- Monte Carlo Simulations
- Simulating A Single Stock Using GBM
- Simulating Multiple Stocks In A Portfolio Using GBM
- An Alternative Approach Using Bootstrap Sampling
- Bootstrap vs GBM : Which Works Better ?
- Conclusion
Geometric Brownian Motion
You may be familiar with Brownian Motion from high school physics. That is the seemingly random motion of air particles as they collide with each other. While we might not be able to characterize the behavior of a single particle, physics and statistics gives us the ability to still describe the likelihoods of where the overall system will end up.
Much in the same way, the Geometric Brownian Motion is a model of an assets returns where the price (or returns) of the asset / shares / investment can be modelled as a random walk (I.e a process where changes in stock prices have the same distribution and are independent of each other. Therefore, it assumes the past movement or trend of a stock price or market cannot be used to predict its future movement (Source : Investopedia)
Rather than dive headlong into some reasonably complex stochastic differential equations , an intuition of how it works is
where:
- Sₜ=the stock price
- dSₜ=the change in stock price
- μ=the expected return
- σ=the standard deviation of returns
- Wₜ=the Wiener Process random variable where increments in t time have a normal distribution N that is centered at zero (i.e Wₜ -W₀~ N(0,t) )
- dₜ=the elapsed time period
Essentially, the GBM model allow us to model future prices based a combination of a “drift” that is driven by the average (i.e mean of the log returns) and a “shock” which is random but can be still be characterized by the volatility (i.e the standard deviation of the log returns).
However it should be noted the model makes a few KEY assumptions:
- The logarithmic change of the stock price is normally distributed *
- Volatility (i.e Standard Deviation of the Log Returns) is constant
- Expected return (i.e Mean of the Log Returns) is independent of to stock performance
(*You may be wondering why we are using returns instead of prices — it’s because returns are scale free (being in % terms rather than absolute values) and often have more ‘stable’ statistical properties (e.g constant mean & variance). Furthermore log returns have the advantage over simple returns because the log of a log-normally distributed random variable will be normally distributed. These three posts (1) , (2) and (3) from statsexchange go into a lot more detail.)
However as we shall see later , these conditions may not always hold true so the validity of any predictions need to be evaluated very carefully.
Monte Carlo Simulations
As mentioned before, I already have a separate article on this topic but in a nutshell : Monte Carlo simulations are a method of making predictions by repeatedly running models that have random variables and aggregating all the results to understand the range of outcomes.
Therefore although each iteration of the GBM forecast will be slightly different, we can make multiple forecasts and aggregate all the results to see overall range of potential price changes within the desired time frame.
As to be expected, the further into the future, the wider the range (i.e the P10 to P90) of possible prices. (As a rule of thumb , there is an academic paper that says that GBM works best for forecasting when limited to max 2 week lookahead)
Simulating A Single Stock Using GBM
The web app was built by extending some code built by another author as per the article below where I combined it with Python Flask to allow users to select the stock counter and the desired historical data range and forecast duration via a website.
The code looks a bit like below where there is a function to extract the stock prices from Yahoo Finance using pandasDataReader
import pandas as pd
from pandas_datareader import dataimport numpy as np, numpy.random
from numpy import meanimport random
import matplotlib.pyplot as pltfrom datetime import datetimefrom scipy.stats import norm
from scipy.stats import kstest
from scipy.stats import skew
from scipy.stats import kurtosis
from scipy import statsdef extract_prices(start_date,end_date,symbols,backtestduration=0):
dim=len(symbols)
for symbol in symbols:
dfprices = data.DataReader(symbols, start=start_date, end=end_date, data_source='yahoo')
dfprices = dfprices[['Adj Close']]
dfprices.columns=[' '.join(col).strip() for col in dfprices.columns.values] for i in range(0,len(symbols)):
noOfShares.append(portfolioValPerSymbol[i]/priceAtEndDate[i])
noOfShares=[round(element, 5) for element in noOfShares]
listOfColumns=dfprices.columns.tolist()
dfprices["Adj Close Portfolio"]=dfprices[listOfColumns].mul(noOfShares).sum(1)
print(f"Extracted {len(dfprices)} days worth of data for {len(symbols)} counters with {dfprices.isnull().sum().sum()} missing data")
return dfprices
After which , another function converts the prices into Log Returns
def calc_returns(dfprices,symbols):
dfreturns=pd.DataFrame()
columns = list(dfprices)
mean=[]
stdev=[]
for column in columns:
dfreturns[f'Log Daily Returns {column}']=np.log(dfprices[column]).diff()
mean.append(dfreturns[f'Log Daily Returns {column}'][1:].mean())
stdev.append(dfreturns[f'Log Daily Returns {column}'][1:].std())
dfreturns=dfreturns.dropna()
if len(dfreturns.columns)==1:
df_mean_stdev=pd.DataFrame(list(zip(symbols,mean,stdev)),columns =['Stock', 'Mean Log Daily Return','StdDev Log Daily Return'])
else:
df_mean_stdev=pd.DataFrame(list(zip(symbols+["Portfolio"],mean,stdev)),columns =['Stock', 'Mean Log Daily Return','StdDev Log Daily Return'])
return dfreturns ,df_mean_stdev
These Log Returns fed as input into another function that estimates the Mean , Standard Deviation which is then pushed along with the initial price to a function that fits the input into the equation discussed in the previous section
def GBMsimulatorUniVar(So, mu, sigma, T, N):
"""
Parameters
So: initial stocks' price
mu: expected return
sigma: volatility
Cov: covariance matrix
T: time period
N: number of increments
""" dim = np.size(So)
t = np.linspace(0., T, int(N))
S = np.zeros([dim, int(N)])
S[:, 0] = So
for i in range(1, int(N)):
drift = (mu - 0.5 * sigma**2) * (t[i] - t[i-1])
Z = np.random.normal(0., 1., dim)
diffusion = sigma* Z * (np.sqrt(t[i] - t[i-1]))
S[:, i] = S[:, i-1]*np.exp(drift + diffusion)
return S, t
There is an important caveat though to the validity of the results. Within the code, a best fit normal distribution is estimated from the historical data but what is more relevant is how “normal” the returns are.
Therefore I’ve bolted on a number of statistical tests for normality that I will not go into detail in this article. Suffice to say, these tests generate a statistic and a p- value that can be tested against a chosen significance level to check the ‘normality’ (*) of the historical returns data. Specifically, I’ve implemented two tests , a Kolmogorov Smirnov Test and a Shapiro Wilk Test with a significance level of 5%. (Why 5% ? Here read this)
(* Strictly speaking , it’s not exactly ‘checking’ per se but I’d rather not get into the mechanics of how Hypothesis Testing works for now)
Unfortunately running the numbers of a few different shares (using about a anywhere from 6mths to a year’s worth of historical data) shows that the returns are not always “normal” as per the arbitrary example below for the stock of CRM (Salesforce) and NFLX (Netflix) over Jan 2019 to Jul 2020.
However for now, let’s just assume that the stocks we will use this method on will behave ‘nicely’.
Simulating Multiple Stocks In A Portfolio Using GBM
A more realistic problem is modelling a stock portfolio that consists of multiple counters.
Unfortunately you cannot just run separate GBM simulations for 2 different stocks and then combine them because although the movements in returns for each stock is random , the returns(*) across stocks are correlated.
(*As mentioned in the previous section — it’s the correlation of Returns that is estimated and NOT Prices. This article explains why)
<WARNING : SOME MATH INCOMING ! >
<Click Here If You Want To Skip Past The Theory>
Therefore to incorporate this correlation effect back into the GBM Model , we use a modified form of the earlier equation that now includes a new term ,Aij, the Cholesky factor of the Covariance Matrix between the stock returns
This Cholesky decomposition term is used on the returns covariance matrix and multiplied by the matrix of uncorrelated random variables to ‘induce’ correlation between them.
As an example, below is the Covariance Matrix for the same example earlier made up of 3 elements — the stock counters CRM and NFLX for the same period. Typically, covariance matrices are bit harder to interpret because they reflect the “absolute” joint variability so another way to visualize the relationship between the variables is to use a correlation matrix which is a “normalized” version of the covariance matrix where each value is between -1(completely negatively correlated) to 0 (no correlation) to +1 (perfectly positively correlated)
Unfortunately with this new term in the equation, in addition to the whether the returns are “normally distributed”, there is another check required on whether the covariance matrix is symmetric positive definite.
Especially for larger portfolios with dozens if different stocks this gets more complex because we have to worry about not just the correlation between 2 different stocks but also all the correlations between MULTIPLE stocks such that the relationships stay “consistent"
As a consequence, occasionally the algorithm may spit out an error message LinAlgError: Matrix is not positive definite — Cholesky decomposition cannot be computed. (This positive definiteness is a characteristic of matrix that allows it to be multiplied multiple times without changing the sign of the vectors)
This is because in practice, share returns data may have noise or especially for large portfolios with many shares (i.e high dimensionality), some shares may be multi-collinear (where there may be interdependencies between share returns).
Note that the multivariate form of the GBM assumes that covariance is ALSO constant over time (Note : Unfortunately in practice, just like the mean and variance, correlation too can change over time )
<MATH WARNING ENDS>
So putting that all together in code form looks a bit like the below where there is another new function to calculate the covariance of the log returns between the different stocks and the GBM function has an additional Covariance term as input.
def create_covar(dfreturns):
try:
returns=[]
arrOfReturns=[]
columns = list(dfreturns)
for column in columns:
returns=dfreturns[column].values.tolist()
arrOfReturns.append(returns)
Cov = np.cov(np.array(arrOfReturns))
return Cov
except LinAlgError :
Cov = nearPD(np.array(arrOfReturns), nit=10)
print("WARNING -Original Covariance Matrix is NOT Positive Semi Definite And Has Been Adjusted To Allow For Cholesky Decomposition ")
return Covdef GBMsimulatorMultiVar(So, mu, sigma, Cov, T, N):
"""
Parameters
So: initial stocks' price
mu: expected return
sigma: volatility
Cov: covariance matrix
T: time period
N: number of increments
""" dim = np.size(So)
t = np.linspace(0., T, int(N))
A = np.linalg.cholesky(Cov)
S = np.zeros([dim, int(N)])
S[:, 0] = So
for i in range(1, int(N)):
drift = (mu - 0.5 * sigma**2) * (t[i] - t[i-1])
Z = np.random.normal(0., 1., dim)
diffusion = np.matmul(A, Z) * (np.sqrt(t[i] - t[i-1]))
S[:, i] = S[:, i-1]*np.exp(drift + diffusion)
return S, t
An Alternative To GBM — Bootstrap Sampling
Unfortunately, the GBM method needs to make a lot of assumptions about the shape of the underlying distribution in order for it to work. Therefore I wanted to share an alternative approach called Bootstrap Sampling With Replacement
This method takes a random samples from the historical data to generate new ‘synthetic’ datasets to predict future prices. In this way Bootstrapping bypasses the need for an explicit theoretical model to generate the forecasts (Hence why it’s called Bootstrapping after the expression “pull yourself up by your bootstraps” which means to improve your situation without outside help).
Within the code, I have simulated this sampling with replacement behaviour using numpy’s random.randint to ‘choose’ a random timestamp to extract the historical log returns from. This is then done repeatedly to create a sequence of random log returns to build up the forecast.
def bootstrap_w_replc_singleval(dfreturns):
columns=dfreturns.columns ____singlesample=pd.DataFrame(dfreturns.values[np.random.randint(len(dfreturns), size=1)], columns=columns)
return singlesampledef bootstrapforecast(dfreturns,T):
columnlist=dfreturns.columns
X=[]
for i in range(0,T):
____X.append(bootstrap_w_replc_singleval(dfreturns).values.tolist()[0])
Y=pd.DataFrame(X)
Y.columns=columnlist
Y.loc[-1] = [0]*len(columnlist) # adding a row
Y.index = Y.index + 1 # shifting index
Y = Y.sort_index() # sorting by index
return Y
However this approach still relies on the historical returns data extracted being a unbiased representation of the underlying behaviour of the stock returns.
Also even though the data no longer needs to be normally distributed, Bootstrapping assumes that the mean and variance is homoskedastic (i.e unchanged over time).In other words, we are assuming any observation (from the historical returns dataset) is equally likely to be selected and its selection is independent.
This method also works for portfolios of multiple stocks but unlike the other method with GBM , with Bootstrap Sampling , there is no need to estimate covariances because we will instead sample “an entire row" of returns from the historical data thereby implicitly capturing any correlations between them.
Bootstrap vs GBM : Which Works Better ?
So which between the two methods , which one works better ? Well it depends….
The basic first check should be to test the normality and skew of the returns distribution (As GBM needs this assumption for the results to be valid)
However in the example above , the shape of the log returns distribution seemed to indicate that the stock returns met normality assumptions and had a fairly even symmetrical shape (Which implied that GBM should have been a valid approach)
However compared with Bootstrap Sampling for the same backtest period of 30 days, the GBM method gave a larger Root Mean Square Error. Therefore my advice would be to play around with both at varying backtesting durations and compare the RMSE-s (Or if you are using the Jupyter Notebook version of this file directly, I guess you could also write your own script to test for MAE , MAPE or whatever forecasting accuracy metric you’d prefer)
Conclusion
So after all that, the big question is… Can these methods be used to predict future stock prices and make a profit ?
Short answer: ….not really (sadly). Most real life stock returns have fat tail distributions and exhibit volatility clustering behaviour (i.e standard deviation and variance does not stay nice and fixed over time) which breaks the assumptions we made earlier.
However for certain stocks within specific time frames where the assumptions may still hold (or come close to being valid) , this method does provide a structured way to estimate relative worst-best case range of short term (~1–2 weeks) returns.
If you’ve visited the website you may have noticed that aside from GBM and Bootstrap Sampling, the web app also allows predictions to be made using other ‘traditional’ statistical time series forecasting approaches like ARIMA, Holt Winters and Vector Auto Regression. Let me know in the comments below if you’d like to me to cover any of these other methods.
In the meantime, if you’ve enjoyed this article — check out my other articles below: