Understanding Probability When Data is Scarce

Published in

GAMMA — Part of BCG X

9 min readAug 22, 2018

Iman Karimi is a Project Leader and Lead Data Scientist at BCG Gamma, based in London. He has a PhD in risk management as well as an MBA in Strategy from Cambridge University.

Big Data currently dominates most discussions on advanced analytics, no surprise given the explosion of data from the digital revolution. But companies are still struggling with problems for which data is scarce. These usually involve events that are both rare and high impact. Insurance companies try to understand the likelihood of an earthquake, for example, or banks factor in the risk of another global financial crisis.

In those situations, it’s better to understand an event’s probability not as a single number, say a 2% probability, but as a range of say 1–5% — or even better as a distribution of probability values. By making probability a bit fuzzy, as a distribution rather than a point, we can help decision-makers treat these events more intelligently.

This fuzzy probability cuts both ways. It can show that a scary event might be more likely than leaders think, so they need to proceed more cautiously. But it also reduces their uncertainty, so they can proceed more confidently as a whole. Data may be sparse, but advanced analytics still enable a better understanding of the outcome’s likelihood.

When Probability Gets Fuzzy

Suppose we have a normal six-sided dice. We see that the dice is structurally symmetrical, so we expect each side to have equal chance of appearing face up — this is our theoretical understanding. We also know of many instances of rolling such a dice, where each side comes up as often as the others — this is our empirical observation. From either of these sources we can confidently estimate the probability of rolling, say, a 🎲 at 1/6, or 16.7%.

But suppose the dice is visibly damaged and no longer symmetrical. We now have no data on past rolls, and assessing the effects of the damage by simulation would be cumbersome. What can we say now about the probability of rolling a 🎲?

Here we have two types of uncertainty:

1. The “aleatory” uncertainty (from the Latin word Aleameaning dice) about the occurrence of an event with a clear probability, such as rolling a 🎲 (one) with a normal dice.

2. The “epistemic” uncertainty (from the Greek word Epistemefor knowledge) that captures the lack of full confidence in the probability value for the aleatory uncertainty.

If we could roll the damaged dice enough times, we could estimate the probability of a certain outcome based on its relative frequency. If in 600 throws we roll 🎲 120 times, then we can confidently assess the probability at 20%, rather than 16.7%. If we have only, say, 15 throws, then we have little confidence in whatever probability value we generate.

This is a real problem for insurers estimating potential losses from earthquakes, floods and storms. We still have big gaps in our understanding of the physics of these phenomena. And (fortunately) the most destructive of these events are still rare, so we have little empirical data to work with.

One solution is to merge our theoretical and empirical sources of data, using the Bayes theorem. But the resulting probability value implies full confidence in that value, even though it comes from two incomplete sources of knowledge.

A better approach, one that’s been around for almost half a century, is the concept of imprecise probability — expressed as an interval between higher and lower probabilities. Take, for example, the Dempster-Shafer theory of evidence [Shafer, 1976 ]. The theory stipulates a lower bound probability, “belief,” and an upper bound one, “plausibility” drawn from the “evidences”, i.e. observed data. Somewhere in that range resides the true unknown probability. This range is different from the simple probability distribution due to multiple forms of aleatory uncertainty.

A Better Probability Distribution

Going further, we can draw on fuzzy set theory, as developed by Lofti Zadeh [Zadeh, 1984 ]. Here we go deeper than a simple interval and express likelihood as a distribution. Chongfu Huang presented this “fuzzy probability” as a “possibility-probability distribution,” or PPD. [Huang 1995 ].

I’ve built on Huang’s work to develop a broader, more consistent and more practical version of the PPD. I’ve also addressed some shortcomings, such as the dependency of the PPD resolution on the number of observations and the inability to incorporate background or expert knowledge. Readers may refer to my journal papers (note) on this topic if interested in the mathematical details. The simplest explanation of the possibility of an event is one minus the degree of surprise if that event occurs. With the symmetrical dice, no one would be surprised to see any number between one and six, so we say each has a 100% possibility and a 16.7% probability. With a damaged dice, while we have much less confidence about probability, the possibility of any side is still 100%, unless the damage makes one side much less likely to appear.

The PPD framework combines partial prior knowledge and scarce observed data to assess the risk of a rare, hard-to-understand event through a distribution.Here’s a real-world case. In the past five centuries, Istanbul has been rocked by 16 powerful earthquakes with magnitudes between 6.8 and 7.4. Figure 1a presents the empirical analysis: a point estimation of probability based on the relative frequency of each magnitude. Figure 1b presents the theoretical analysis: the probability from the Gutenberg Richter relation, the standard approach in seismology, where frequency declines with magnitude.

*Figure 1: The magnitude probabilities: (a) based on observed data; (b) based on Gutenberg Richter relation*

We can clearly see the contradiction between probabilities derived from the sparse observed data and the more regular prior knowledge. Yet we are not confident about probabilities for either method. Figure 2a and Figure 2b show the possibility-probability distribution of magnitude if we take each of these sources independently using my formalization. We can combine these empirical and theoretical results through a Bayesian formulation, resulting in Figure 2c. Here we visibly capture the information from both sources and resolve the contradictions, especially for magnitudes 6.9 and 7.3, where prior knowledge and observed data indicate vastly different probabilities.

Moreover this improved PPD is a more informative and reliable way of expressing risk, as we can see by applying it to the insurance industry. Consider the industry’s common method of expressing risks of loss, the exceedance probability curve. As shown in Figure 3, for each amount X, this curve is the cumulative probability of having a loss equal or greater than X. It will always have the probability 1 at zero loss, and there is a certain loss amount that is impossible to exceed, often dictated by maximum hazard or maximum exposed assets.

*Figure 3: a sample exceedance probability curve*

Let’s express this curve as a PPD. The probability of exceeding the Loss lkis given by:

Where P is the fuzzy exceedance probability, and π(l,θ) is the possibility value of l for each probability θ, as seen in Figure 4.

*Figure 4: Loss fuzzy probability of exceedance*

This fuzzy probability distribution of loss can be applied to present a more realistic comprehension of the actual risk, as seen in Figure 5a, if we want to find out the fuzzy probability of loss greater or equal to a certain loss lk, which is equivalent of cutting the 3D surface of Figure 4at that loss level. We can also the see the possibility distribution of loss corresponding to a certain probability of exceedance of θj as in Figure 4b.

*Figure 5: Schematic possibility distribution of (a) probability of exceedance of loss lk; (b) loss corresponding to a probability of exceedance of θl*

Moreover, representing PPD as a fuzzy relation allows us also to go beyond computing the corresponding possibility distributions to a precise or “crisp” probability or loss. We can employ fuzzy set theory to formalize “lingual” variables such as “improbable,” “young,” and “tall.” For example, we can seek the loss distribution corresponding to a “high” probability, or the probability distribution of a “catastrophic” loss, instead of probability of 90% or a loss of $500m. In order to do so, a so-called “membership function” should be defined for lingual variables. In classic set theory, membership of a set is binary: 1 if a point belongs to that set, 0 if it doesn’t. In fuzzy sets, the membership could be any value between 0 and 1.

*Figure 6: membership function of a classic (crisp) set and a fuzzy set*

The fuzzy membership function of “high probability” or a “catastrophic loss” should be defined in the application context by experts. Fuzzy relational operations then allow us to derive the corresponding PPD.

These possibility distributions can be defuzzifiedto become crisp values of probability or loss. The most common defuzzification method is taking the maximum of the membership function as the defuzzified value. But a blind defuzzification would deprive decision-makers of valuable information on the distribution — and hence confidence levels, which might help to avoid wrong estimations of the risk.

Consider that the likelihood of a severe loss is expressed by the probability patterns shown in Figure 7a and Figure 7b. The defuzzified values of both fuzzy probabilities are equal to θiregardless of the defuzzification method. According to Figure 7a, we are certain that the true probability lies within the interval

. If we establish this same probability interval in Figure 7b, we see that the vertical line cuts the possibility axis at a value around 0.4. Due to how we compute the PPD in this method (see my papers in the references) such an “alpha-cut” ( a»0.4) has a confidence level of 1-aor ~0.6 in this case. While the defuzzified probability of the two distributions is identical, in one case we are 100% confident that it will be confined in the [θl(1), θu(1)]interval and in the other case only 60%.

Yet, if these two probability intervals are expressed as the likelihood of a risk, the second one actually reassures many people, because the greater uncertainty gives them a kind of pseudo-optimism that things probably won’t go terribly wrong. For a big loss, the fuzzy probability might be 20–30%, while the defuzzified result would be 1–49%. The two ranges have the same average risk of 25%. But the natural human tendency toward optimism might lead decision-makers with the latter result to mistakenly see an actual probability closer to 1%.

And just to show that having the biggest probability interval is not as valuable as having the entire fuzzy PPD, you can compare Figure 7b with Figure 7d, which is “less fuzzy.” Both have the same support (probability interval with confidence of 100%) and yet the probability interval of has a confidence level of only ~0.8. in Figure 7d.

The possibility distribution of Figure 7c has the same defuzzified probability as 7d. However, the centroid of the third distribution lies to the right of the core, which signals that the least surprising probability might be an underestimation of the risk, yet another manifestation of pseudo-optimism.

Conclusion

We implemented this risk assessment framework, with tools and techniques from fuzzy sets and probability theory, to address multiple types of uncertainty. In cases with a sparse amount of observed data and a limited physical knowledge of a phenomenon, we can augment the usual probability distribution with another dimension of uncertainty. Thus we can model the imprecision of the actual probability. This hybrid framework of fuzzy probabilistic risk incorporates the reliability of sparse data and incomplete expert or background knowledge. It is therefore suitable for combining data-driven and knowledge-driven modelling of risk and decision making.

References (also given as links in the texts)

G. Shafer, 1976. A mathematical theory of evidence. Princeton University Press.
C. Huang, 1995. Fuzzy risk assessment of urban natural hazards. Fuzzy Sets and Systems, 83:271–282.
L. A. Zadeh, 1984. Fuzzy probabilities. Information Processing and Management, 20:363–372.
I. Karimi, E. Huellermeier, and K. Meskouris, 2007. A Fuzzy-Probabilistic Earthquake Risk Assessment System. Soft Computing, 11:3.
I. Karimi and E. Hüllermeier, 2007. Risk Assessment System of Natural Hazards: A new approach based on fuzzy probability. Fuzzy Sets and Systems, 158: 987–999.

Understanding Probability When Data is Scarce

Written by Iman Karimi