Building A Probabilistic Risk Estimate Using Monte Carlo Simulations

Zhijing Eu
Analytics Vidhya
Published in
16 min readNov 12, 2020

A Structured Approach To Making A Data Driven Best Guess With XLRisk

Image Source : Geek Dad https://geekdad.com/2018/02/part-2-build-perfect-dungeon-master-dice-set/

In this installment, I explain how to use Monte Carlo Simulations to build a probabilistic estimate even if you don’t have all the relevant information , have to deal with uncertain variable inputs and are exposed to potential risks using an Excel Add In called XL Risk.

Outline

  1. Introduction
  2. Just Give Me The Number Already !
  3. The Problem With Single Point Estimates & Best-Worst Case Scenarios
  4. A Primer On Probability Distributions
  5. Monte Carlo Simulation Tools
  6. Modelling Uncertain Ranges
  7. Modelling Discrete Risk Events
  8. Modelling Dependencies / Correlations Between Uncertain Variables
  9. Understanding Key Factors Driving The Results
  10. Conclusion

1.Introduction

Life is full of uncertainties and often it can be hard to make the best decisions when faced with situations where you might not have all the relevant information or have to deal with the possibility of events that may impact our objectives.

This post outlines a technique called Monte Carlo Simulations * that involves building a model with a set of pre-defined random variables as inputs and then repeatedly running the simulation and analyzing the aggregated outcomes across all the simulations. (*Random Trivia — the term was coined by John von Neumann after the city in Monaco famous for it’s casinos. Aside from his other achievements in nuclear physics, game theory and computing , von Neumann also suggested the idea of neural networks way back in the 1950s!)

Although this topic is not exactly data science related , I’ve been meaning to write this article for some time as it’s a subject close to my heart as I’ve spent a good chunk of my professional career as a Project Risk Analyst doing precisely this sort of work. Also , I have tangentially covered this method in some of my other posts

e.g. where I used Monte Carlo simulations to model spread of COVID-19

Or… where I used Monte Carlo simulations to model how select the Optimal Project Portfolio using Linear Programming even under uncertain input conditions

2.Just Give Me The Number Already !!!

When faced with questions like:

  • How much will building a house cost ?
  • How much will this holiday set me back ?
  • How many covid-19 infection cases do we expect within the next 6 months ?

..it is tempting to provide a single figure estimate as the answer.

Unfortunately a single figure estimate ignores the fact that the ‘true answer’ will likely be within a range of figures that will depend on

  • The level of definition of the question and the underlying assumptions. (e.g. what sort of house did you mean exactly?— is this a construction of a brand new house or a renovation of an existing house etc)
  • The availability and nature of supporting information used to make a meaningful estimate (e.g Consider the difference if we only had the historical cost by sqft of properties sold in a particular city but without any other granularity vs if we had a more comprehensive database of actual values of transacted property sales grouped by property type)
  • The tolerance for variability or desired certainty we want in the estimate. (E.g If it was a question like the price of a cup of coffee where the consequence of inaccuracy is trivial vs A forecast of Covid-19 case numbers used to plan for sufficient hospital beds, where we would need to be much more conservative and prefer to err on the side of caution.

For the rest of this article , we’ll use this a simple fictional example of budget planning exercise for a vacation to a fictional winter island country to answer the question of “How Much Will My Holiday To Fantasy Island Cost ?

Behold ! Fantasy Island — The holiday destination of your dreams (Although you might want to give white water rafting a pass though)

3.The Problem With Single Point Estimates & Best-Worst Case Scenario Estimates

As a first attempt — a simple ‘deterministic*’ approach would be to list the relevant sub elements that would make the total holiday cost and then make an estimate for each sub element and compile the resulting total.

(*A deterministic estimate is one where no randomness is involved and will thus always produce the same output from a given initial input )

This yields a straight forward answer of about 4.2k USD

But wait ! We may have more information than what’s in the above table because for each of the elements , we have a sense of a high and low figure because we compared the pricing for a few hotels and some AirBnB/hostel rates and we also had some feedback from 2 or 3 different friends on their own family’s experiences on holiday tour costs and meal costs etc.

Perhaps a slightly better way of doing this is using a range of figures. The results are still deterministic but at least now there is a ‘max’ and ‘min’ figure.

Now , with this ‘best-worst scenario’ approach we have a range of outcomes from 3.0k USD to 5.4k USD. This is an improvement but the problem now is if we wanted to set a budget aside should we just assume a number in the middle of the best-worst cases i.e 4.2k USD ?

4.A Primer On Probability Distributions

Well…the answer to the previous question is — it really depends. If you have more information than just a best — worst case estimate (For example if somehow had the results of a comprehensive cost survey of hotel rates within Fantasy Island) OR you wanted a conservative holiday budget figure (i.e one with some contingency so that it has a high confidence that the budget will not be exceeded), a probabilistic estimate* may be the way to go.

(*A probabilistic estimate includes elements of randomness where every time you run the model, you are likely to get different results)

However before building a probabilistic risk estimate — we first we need to cover some basic theory about probability distributions:

A probability distribution describes the likelihood of obtaining the possible values for a certain random variable. A trite but useful example is the outcome of a six sided dice roll.

Probability Density Functions

The chart below is a probability density function (PDF) where the X Axis represents the value of the random variable and the Y Axis represents the likelihood of ‘seeing’ this corresponding X value occur in a single simulation (Assuming a fair die =1/6 i.e The chance of seeing each number is 0.1666667)

You may also be familiar with the normal curve which has a distinctive ‘bell shaped’ curve which unlike the dice example, is a continuous variable that isn’t just a whole number between 1 to 6 but includes any number between the min-max figures.

A PDF plot for a normal distribution centered around 0. Image Source : https://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm

Cumulative Density Functions

There is also another way of visualizing probability distributions called a cumulative density function (CDF) where the Y axis is NOT the probability of the occurrence of the X axis value but instead the probability that the variable takes a value less than or equal to the corresponding X axis value. So for the chart below, the 75% percentile is about 0.6-ish.

A CDF for the same normal distribution shown earlier. Image adapted from: https://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm

There are MANY different types of other probability distributions (and technically , if you have sufficient reliable raw data — e.g. a detailed price survey of hotel costs , you could actually fit the histogram of price ranges directly into the model ) but for now, I’ll introduce a trusty work-horse called the Beta Pert Distribution:

You can read the details in the link but the TLDR version is that it is a very convenient continuous distribution as it only needs three ‘intuitive’ values to define it — a worst case , most likely and best case estimate

So in summary putting all these together , a visual explanation would look a bit like this:-

So having dealt with some of the basic theory — we move to how we can actually build a probabilistic risk estimate.

5.Monte Carlo Simulation Tools

There are a number of spreadsheet based tools that go beyond what normal plain vanilla Excel does in terms of probability sampling and dealing with correlated variables (more on that later).

The leading packages in the market are:

(Side Note : Part of the benefit of running the simulations in a spreadsheet is the ease of defining and adjusting the inputs for the model. However if you are a Python enthusiast and prefer programming code , you can also achieve the similar results using a scipy.stats library (Which I have done in another post). )

Unfortunately all the three leading example tools above are paid subscription only but someone has built an open source alternative called XLRisk that has the same basic features of these commercial tools and is what we will be using for the rest of the example.

XLRisk adds a custom menu on the Excel Ribbon

If you would like to follow along, the Excel model file is available in the link below

6.Modelling Uncertain Ranges

In this scenario — we go beyond just hi-low estimates and also make “most likely” estimates for certain line items.

The exceptions being the estimate for Holiday Activity/Tours where we can’t really make this “middle’ figure estimate so we just use a hi-low figure rather than the earlier 3k Fantasy Bucks estimate. Another exclusion is the Shopping expenses that we will rather keep as a “fixed” figure where we will ALWAYS only spend 1k Fantasy Bucks (In a sense you could say it’s outside of the model since it has no variability)

Running a 1,000 simulations leads to the output below where the Mean Budget Estimate is 3.7k USD. However — take note — the point is NOT about getting a single value — it’s the output distribution that matters.

Using a Monte Carlo simulation we now have the range of outcomes from 0th to 100th percentile which can be useful if you want to set a budget. If you are conservative and want a budget estimate that is ‘rarely’ ever exceeded, you might opt to use the 90th Percentile figure — i.e meaning that 90% of the time, the actual cost will be BELOW this estimate)

7.Modelling Risk Events

In addition to variability within the various cost elements, there are also events that may only sometimes occur but if they do , they generate a lot of inconvenience and additional costs for the holiday.

This can be modelled using the Bernoulli Distribution which is another discrete distribution (like the dice roll example earlier) . You can think of as a coin flip with a defined likelihood of of heads/tails to turn a risk on or off in the simulation.

Example of a Bernoulli Distribution with a 60% chance of the variable being 1 (Source https://mathworld.wolfram.com/BernoulliDistribution.html)

If we incorporate this approach into the model as per below to include 2 “risk events” into the total

The resultant output will have this sort of behaviour where you can see at the tail there is a bump that reflects what happens when the 10% or 5% chance risk events ‘kick in’. Although the “middle shape” does not change much i.e the Mean and Median are about the same, the Max value has now gone to 5.9k USD compared to the earlier 4.9k USD without the Risk Events.

We didn’t use this in the example but in theory, we could also include ‘2nd’ order risks where the second event only occurs IF the first event triggers.

For example maybe there is a risk that not only do we lose our belongings but a secondary risk that we also end up exorbitant charges due to our stolen credit card being used by an unscrupulous fraudster to fund their crazed shopping spree. The challenge with this sort of 2nd order risks modelling is the need to ‘calibrate’ the likelihood for the second event so that it truly reflects the conditional likelihood. All of which is a convenient segue into the next subject…

8.Modelling Dependencies Between Uncertain Variables

Right now, the model is set up such that the random sampling of the values from the probability distributions are assumed to be independent.

Due to something called the Central Limit Theorem, what happens when you have a lot of random inputs that are all independent is that the ‘individual randomness of each input cancels each other out” — i.e you end up with a nice clean and narrow normal distribution. This does not align to reality where in most cases budgets are over-run rather than underrun so something must be going on…

Put another way, if we treat all the variables as independent — we are assuming that if the accommodation costs are high , the food cost can still be low and there is NO relationship between these variables.

Now in practice we know that there is likely to be some sort of relationship between these cost elements — e.g. if we were to stay in a swanky expensive hotel in a fancy part of town, we might be more likely to take our meals in more upmarket and pricier restaurants surrounding the hotel.

This step of ‘teasing out’ the relationship between variables can get quite complicated but since this is a ‘toy example’ , we are going to keep it simple and just assume that there is a completely arbitrary 85% correlation between the three items for Accommodation , Meals and Activites/Tours. The strong positive correlation means that the values sampled will tend to move in tandem (i.e if accommodations are expensive, meals will be too)

Also to round this off , in all the other examples, we’ve assumed a simple FIXED exchange rate of 4 Fantasy Bucks to 1 USD — Let’s also update this to show a range instead. This ‘FOREX uncertainty’ variable is notable in that is ‘amplifies’ the uncertainty of the overall result (as the other line items which in themselves have uncertainty is multiplied with this variable FOREX rate)

When the simulation is now run , notice that the shape of the resultant curve has become more “spread-out” compared to the earlier run without correlation effects (e.g. the Std Deviation/Variance has increased slightly) — if we were to have a more complex model with more variables that were correlated , the effect would be more pronounced as the general principle is that the more strongly correlated the variables the ‘wider’ the spread of the resultant curve.

A Technical Side Note On Correlation Modelling

The trick used to ensure the ‘sampling’ of the random variables are ‘correlated’ is the Iman Conover Method which basically involves the generation of ‘paired random numbers’ that are used to sample the uncertain variables such that in each iteration of the simulation the resulting sampled values for the variables match a defined rank-order correlation coefficient.

If that sounds like a lot of math gobbledygook — maybe these two articles explain it better ( The first article is for @Risk but the general approach is similar for XLRisk too. The second article goes into more specifics and is also the basis of how XLRisk does it)

However in practice —modelling dependencies between random variables is a complex topic and correlations are often difficult to estimate accurately. For example in 2007–2010, a flaw in how correlations were modelled between different tranches of Collateralized Debt Obligations was as one of the key factors that precipitated the Subprime Mortgage Crisis — ref this article from Wired Magazine)

Although the Iman Conover method is useful due given how easy it is to define , it only simulates fixed correlation patterns i.e it can’t handle situations where the correlation may be different depending on which part of the random variable is sampled. For those of you who are keen to dive deeper — there is a more advanced way to combine separate marginal distributions into multivariate distributions using Copulas. (However not all packages have this feature — e.g. ModelRisk and @Risk do but XLRisk and Crystal Ball do not)

9.Understanding Key Factors Driving The Results

So what ? — you might be asking. Simply having the resultant distribution of the estimated Total Holiday Cost is interesting but not very useful. The challenge now is to turn this into some actionable insight.

XL Risk auto-generates a ‘tornado chart’ that plots a horizontal bar graph ranking the correlations between each random input variable (i.e uncertain ranges or risk events likelihoods or impact ranges) and the outcome where the bar length reflects the strength of the correlation.

Basically the longer the bar , the more sensitive the variability in the overall result is to the uncertainty in that particular variable. In the holiday budget example, it’s a bit of an obvious answer but the Plane Fare uncertainty (being the element with the biggest contribution to the total budget) is the Top driver.

The slightly more interesting insight is that the Holiday Activity Tours Uncertainty is ranked higher than Accommodation Cost even though both variables have a similar ‘mean’ value individually — this is probably due to the wider range of uncertainty on Holiday Activity Tours Costs when compared to Accommodation Costs .

Therefore the trivial advice would be — if the objective is to reduce variability in the budget — book your flights early to ‘lock in’ the plane flight price (which means, flight costs won’t be a range anymore but just a fixed value). On the medical emergency risk ,you could lower the 10% likelihood by avoiding sky diving , bungee jumping or other high risk activities.

(Side Note: There are other ways to rank the relative ‘impact' of an uncertain variable on the overall result — Palisade’s @Risk offers a Tornado diagram showing Regression Coefficients instead of Correlation Coefficients, VoseSoftware offers Tornado diagrams that show the Conditional Mean (i.e the mean of the output for the lowest and highest tranches of the input variable if all other variables are held constant at their ‘mean’ value) You could also run the simulation model multiple times removing each risk event one at a time to determine the relative impact of each)

In practice, the analysis / modelling process tends to be iterative as it is common to re-run the Monte Carlo Simulation with refined inputs as the planning is better defined or we commit to certain decisions or value of sub-elements become ‘locked in’ and no longer uncertain (E.g. The selection and booking of a specific hotel)

10.Conclusion

To re-cap:

1.Monte Carlo Simulations work by repeatedly sampling from a set of pre-defined random variables within a model and aggregating the overall results to understand the overall range of outcomes

2.Inputs to Probabilistic Estimates are typically made up of:

  • Uncertainty in existing components/sub-elements that are ‘continuous’ in nature (where a numbers may go higher or lower)
  • Discrete Events that have a likelihood of occurrence and once triggered tend to create a significant impact that may be ‘discontinuous’
  • Dependencies between the said variables in form of correlations, 2nd order discrete events or ‘multiplicative’ effects (i.e A random variable that is multiplied with several other random variables)

3. Monte Carlo simulations can be used to gain valuable insight into the interactions between different uncertain inputs and to understand the key drivers of variability in an estimate (and hence guide better decisions or development of response/mitigation plans )

4. However the accuracy of Monte Carlo simulations rely on the robustness of the underlying assumptions / exclusions used in the model and the reliability of the information used to estimate uncertainty of the model sub-elements

Hopefully, this article has given you a good introduction in the use of the Monte Carlo Simulations to build probabilistic risk estimates. Although the Holiday Budget example was a trivial fictional example, the principles shown are used across a wide range of fields from Project Management, Finance/Investment , Insurance to Health & Safety.

Leave me a note below in the comments if you enjoyed this article or if you have any other feedback !

--

--

Zhijing Eu
Analytics Vidhya

Hi ! I’m “Z”. I am big on sci-fi, tech and digital trends.