Sampling, probability and certainty

Published in

Forecasting using data

16 min readJul 7, 2017

Chapter 5

This chapter continues to explore statistical sampling. We look into the probability the answer given by sampling likely matches reality. We mention (again) that when using sampling you can NEVER be “certain,” but we can be sure enough to tip the balance that a decision we make is more informed than one made with no data at all.

Goals of this chapter –

Define probability and certainty
Learn to calculate probabilities for unknown fixed and range values
Look at how the number of samples relates to probability of an estimate being “right” on average

Probability and (un)certainty

“Uncertainty indicates we have limited knowledge about the future and can only represent our understanding with possibilities, and the probability of those possibilities”
(Spetzler, Winter, & Meyer, 2016)

Probability puts a number on how likely one possible future outcome is versus all the other possible outcomes. Often stated as a percentage[1] with 0% meaning a possible outcome has the lowest chance of occurring, and 100% meaning that possible outcome has the most chance. This book helps to understand the grey area between certain and never — where the outcome is uncertain[2] at this time, but will be known at some future time.

When talking about probability, it’s important to remember just how hard it is to predict the future. If it was easy, I’d suspect people who perform prediction on stage (or on Wall Street) would all be retired and own small islands; And all faith healers would be doctors. No matter how certain something seems, guesses, even by well-informed experts are still guesses no matter how vibrantly presented. Nothing is certain except death and taxes, as they say.

While on the subject of death, here is my prediction: I will die one day. The longest documented instance of human species life is 122 years and 164 days[3] and that’s enough evidence for me to believe I’m done for, at some time before that age, in some exotic and hopefully rapid fashion. That’s not to say immortality can’t happen to me, but I’ve left my run a little late to eat more vegetables, exercise, and limit drinking before 5pm. So, I give the highest probability, 100% to me eventually dying, but that still leaves the questions of when and self-indulgently, how unanswered.

To predict the method and time of my death, I need to apply probabilities. I look at the historical age of death for people similar to me (male, white, no cigarettes or recent skydiving activity) to give a probability of what my age and method of departure could be based on historical frequency. Even with no data about my death, the data about similar others is enough to get a set of possibilities, and a probability for each of those possibilities. This is gives a good enough outcome that insurance companies have worked out what my life insurance premiums should be to make insuring people similar to me a profitable activity on average. Some of us will die before planned, and some well after, but in the long-run insurance companies will make a tidy profit. This book is about how we can use similar techniques for forecasting software projects using similar historical data. We don’t know when your project will finish, but we do know how other similar to it have delivered and can use that to make a more informed guess.

The well-meaning quote often given “absence of evidence isn’t evidence of absence,” says immortality can’t be ruled out just because we haven’t seen it occur yet. Although we can’t rule out this possibility, I’ve seen enough obituary entries in the local newspapers to consider my death is so close to inevitable that semantic nuance won’t save me (if you are under the age of 18, google “newspaper”). John Cook, a colleague who is a PhD in mathematics and applied statistics and rarely gets rattled. He is calm and precise. Except one day on the phone when we discussed the absence of evidence quote. “Dinosaurs; if they still existed I think we would have seen one. Every day that goes by, we have to be feeling more confident that extinction occurred even though we have no strict evidence.” John is saying that even without conclusive observational evidence, which would need to be the simultaneously look at every part of the planet at the exact same moment for dinosaurs, every day we don’t see a Tyrannosaurus playing joyfully with our children increases our confidence they don’t still exist in living form. Certainty grows the more samples we have, and the longer we reliably observe. But we can never be certain, one sample could change everything.

I’m going to stick my neck out and assume some of you who read the previous paragraph are thinking about swans. Especially the color of them. The Black Swan Theory describes how common knowledge, facts and wisdom known as true to everyone, can sometimes be wrong. The story goes that in Europe it was common knowledge and even formalized in a Latin proverb, that through observation all Swans are white. A “black swan” was used as a metaphor for something that doesn’t exist, so certain are we that an alternative colored swan was thought ridiculous. That was until 1697 when Dutchman, Willem de Vlamingh, observed black swans frolicking happily off the coast of what is now known as Australia (New Holland at that time). One boat trip meant it was now known for certain that not all swans are white. Even with millions of observed samples, one black swan was enough to render those millions of prior observations moot. If there was one black swan, how many other colors are there? Pastel purple, pink or blue? None of these colors are out of the question now we know that not all swans are white. The mind boggles that we can’t be certain of anything anymore. Before pouring a stiff drink to calm your nerves, consider, does this really matter? The chance of a living swan being seen in Europe that is white in color is still almost 100%, then and now, except in zoos and apparently a few escapees in the wild. If you are asked to take a bet on what color a swan is in Europe, bet on white. There are considerably more white swans, so the overwhelming chance is still white. Context matters, and even though we know some swans can be black in color, it’s not relevant to my probability in my local context sitting in a park somewhere in Paris.

Nassim Taleb picked up the Black Swan theme in his books on life and uncertainty. He uses the term Black Swan’s to represent events that were unforeseeable in advance, obvious in hindsight, extremely rare verging on unprecedented. They are also highly, highly, impactful (Taleb N. N., 2007). Black Swan events break the standard tools of probability and predictions using historically observed data. They also account for breaking many banks, economies and civilizations. We need to be aware of them, but we also need to acknowledge their rarity and the futility of being paralyzed in analysis attempting to account for them. Being unforeseeable in advance makes them challenging to account for. It’s worth reading Fooled by Randomness (Taleb N. N., 2005) and The Black Swan (Taleb N. N., 2007) to understand their definition and impact. These books are comical in places, alarming in others, but always eye opening about how to apply uncertainty in the real world. For now, overly worrying about Black Swan events gets in the way of learning how probability applies on average and supersedes our gut instinct in decision making. We will seriously consider them later in this book, and arm you with thinking techniques to account for them when forecasting to the degree that is possible. For now, all swans are white unless you are in Perth.

Returning to our original definition, a probability is a measure of how certain we are one possibility is more likely than any other possibility about something occurring in the future. To calculate probabilities, we need a good list of possibilities, and a way of sharing the 0% to 100% probability pie among those possibilities.

Possibilities and Probabilities

How many possibilities are there when tossing a traditional coin? Many of you said two, the possibilities are heads or tails. My wife is a lawyer, and she would say (and did) that there isn’t enough information to answer that question. The coin could land against the wall and be ambiguous, it could roll down the drain or off a cliff and be un-observable. Her wording of the question would be “What is the chance of rolling a head?” Strictly speaking the answer is 1 in 3. Heads, tails and un-observable are all possibilities. Many of you are now shaking your head and saying, what is the chance of un-observable happening? And that the point of this chapter, and book. Un-observed is important to consider and manage. If head or tails is played on a flat surface with lots of space, there is very low chance of un-observed. If the coin is tossed near a cliff edge and I don’t want any of you to go out and attempt to prove this, there is a lot higher chance of un-observable.

When building a set of possibilities, you need to consider even the rare ones, so you can determine if context or location might increase or decrease the odds of occurring. Small changes in context can change the probabilities in meaningful ways. Asking what color a swan is when in Europe has a higher chance of being white. When in Western Australia, an equal or higher chance of being black because of their coolness. Context matters. And to stress again, the possibilities of swan color isn’t just black and white. The possibilities are black, white and another color we haven’t seen yet.

Generating possibilities takes imagination, and sometimes experience or a sinister mind (as in my wife’s case, hi darling!). When returning planes were analyzed for holes (remember in the previous chapter), it was the un-observable planes and “no” holes that were important. Only by considering all of the possibilities can true answers emerge, no matter how many samples we can observe. Absence matters.

For the rest of this book, when I say coin toss, it means the observable possibilities when tossing a coin. Now the odds are 50% heads, 50% tails. We revisit observable and un-observable possibilities throughout this book to make sure they are considered, and only dismissed when properly accounted for.

Observable possibilities of coin toss:
Heads — 50%, Tails — 50%

How many observable possibilities are there rolling a six sided dice? Many of you said six. And your right. A standard six-sided dice has the numbers 1 to 6 printed on the six sides of a cube. What is the chance of rolling any one of the numbers? 1 in 6. Standard dice are designed to give an equal chance for each side. Rolling a dice hundreds and thousands of times should give an equal number of each value. Except, that’s not the case or at least guaranteed. Each dice roll is independent. The value rolled on the previous dice roll plays no influence in the odds of the next roll. There is no physical force that means the odds of each side showing an equal occurrence rate is mandatory. Every roll could be a 1. Or a 6. Or 2, 3, 4 or 5. We often expect randomness to mean no pattern. But it means there is no guarantee of any identifiable pattern over the long-run. Rolling six ones in a row has the same odds as rolling, a sequence 1, 2, 3, 4, 5 and 6 or any other set of numbers.

Observable possibilities of a six-sided dice roll:
1–16.7%, 2–16.7%, 3–16.7%, 4–16.7%, 5–16.7%, 6–16.7%

How many observable six-sided values can be less than 4? The correct answer is 3. 1, 2 and 3 are the only valid observable values less than 4. What are the odds of rolling a value less than 4? It is the sum of the probability of each “right” value. That would be the chance of a 1, 2 or 3 summed. Which gives the answer of 50% after accounting for a little rounding error.

16.7% + 16.7% + 16.7% = 50%

An alternative way and often easier way to calculate this, we take the number of “right” possible values (3), and divide by the number of all possible values (6).

To calculate the probability of a single or set of possible values, we divide how many possibilities are in this group by the total number of possibilities. How many observable six-sided dice values are at least 2?

This is as complex as probability gets. To compute a probability –

1. Gather a set of possible outcomes. Take care to look for un-observable outcomes that may be missed, and tighten the definitions so you know what you are measuring.

2. Count the number of possible outcomes that match the desirable criteria.

3. Divide the count of desired outcomes by the total possible outcomes.

Range probabilities — probability without knowing all the possibilities

With dice and coin problems the observable possibilities are finite and known. When calculating probabilities from sample observations of an unknown range of values, there can be no way of knowing in advance all of the possibilities. We just know the sample values we have seen so far. The next one we see could be above or below or between the ones we have seen. Calculating probabilities for range problems requires knowing how many possible interval positions a value could fall between. Rather than calculating the probability of an exact value, the probability on average that the next sample falls into a possible interval range is computed.

The following scenario is for samples taken from a non-repeating set of possible values. Just like tank serial numbers! You could also consider elapsed time as a non-repeating sequence of values, and I often use these same formulas when dealing with lead-time and cycle-time values as you will learn.

Let’s look at the tank serial number problem again. The serial numbers start from 1, are sequential, and ascend to an unknown maximum value. When we get the first sample, we don’t learn a lot, except the next sample has a 50% chance of above and a 50% chance of below the first sample. The second sample gives us a lot more information, we get a range, and any future sample has three distinct possibilities. From one up-to the lowest sample, between the two samples, or above the highest sample. These partitions are called intervals. Given no other information about likelihood, each possible interval needs to be given the same probability of 33%. With a third sample, there are now four possible intervals the next sample could fall. Below the lowest seen, between the lowest and the middle sample, between the middle and the highest, or above the highest sample. Four equal possibilities each with a 25% chance. If we need to know than chance on average whether the next sample serial number is higher than the highest we have seen, the answer would be 25%. It’s 75% more likely that the next sample is lower than the highest we have seen. By three samples, we have reduced our probability that the next sample is higher than we have previously seen by 75%. Figure 8 shows range probabilities for equal chance intervals.

Figure 8 — Probability next sample is within an interval on a non-repeating sequential range for 2 and 3 samples

Applying the logic shown yields Equation 5and Equation 6. These probabilities are the average chance. No certainties, just that it’s more likely than not that the result falls into the equated probability if run enough times. Problem is, you are just running the calculation once.

Equation 5 — Probability on average for each possible interval above, below or between samples so far

Equation 6 — Probability on average the next sample falls within the range seen so far

Table 3 shows the results of Equation 5 and Equation 6 for 1 to 30 prior samples. Notice how quickly the certainty the next samples is within the previously seen range. A common use of these calculations is to understand what is the chance the next sample is above the highest seen so far. Given that the interval above the highest seen is just one interval, it is equal to the probability of one interval (Equation 5).

Table 3 — Probabilities after n prior samples

This rapid increase in certainty of where the next sample on average falls often surprises people. Less data than thought can help make informed decisions. In fact, after nine samples, each sample only improves a probability of each interval by less than one percent. Yes, I am telling you that if samples are reliably taken, there is diminishing return in how much is learnt after nine samples or so. For some decisions, this is enough to make an informed decision. It may not be certain enough for medical decisions about me, but it’s often enough to shine a bright light on a bias or an errant gut instinct.

We have looked generally at sampling here. A lot of assumptions are made that may not be true for all circumstances. For example, not all values might be sequential or unique. The chance across the entire range may not be evenly distributed. Context always matters and later chapters continue to explore how to reliably sample to answer specific questions.

Tank Problem Redux

In the last chapter we looked at how to estimate the number of tanks produced. Sure, this problem of tank production doesn’t come up every day. The general problem though is estimating the likely range of values and how likely another sample might be above or below the extremes. How certain should you be after how many samples? This Taxi Formula follows similar logic to estimating the probability of samples falling in different parts of an actual range. Assuming the samples are reliably taken and the actual serial numbers are non-repeating and sequential, the interval average will emerge quickly. The taxi formula, takes the highest seen and adds one average interval. It takes the highest seen and adds a bit.

If the story recounted is accurate, the researchers had secured two tanks. They had two serial numbers from the transmission. That sounds tiny, but it means that there is 66% chance they have seen the highest so far. There is no way of knowing for certain, but the odds are in their favor. The researchers boosted their chances by using the serial numbers from the tank track rubber dolly wheels. Each tank has 48 of them, and that means the samples count is 96. This means the chance of seeing a high serial number sample were increased, more of the range of numbers explored, and the computed average interval will be more accurate.

Knowing (even approximately) the highest and lowest serial number seen makes it pretty easy to estimate how many serial numbers were allocated by just subtracting the highest from the lowest. Subtract the minimum estimate from the highest to estimate, divide by how many of that component are used on each tank, and you have how many total tanks produced. Double check by doubling the average of the samples, and triple check by doubling the median of the samples. This gives three different methods to compare and contrast. Averaging all of the answers balances the risk any one of them being way off.

How certain? plugging 96 into Equation 5 to find the average interval probability results in 1.03%. The assumption is only one interval above the highest sampled, meaning we should feel 98.97% confident we have seen the highest. There is a reason the estimates were so close to the eventual actual determined through post-war records. Sure, they were lucky. They couldn’t know for sure the serial numbers were sequential, or if the two tanks they procured were from the same manufacturing plant and given close proximity serial numbers. But, probability was on their side. I think we can agree, better than the 1,000 estimate!

With sampling you can never be sure, but you can’t be any surer without sampling.

Summary

Given our definition of uncertainty:

“Uncertainty indicates we have limited knowledge about the future and can only represent our understanding with possibilities, and the probability of those possibilities”

This chapter has shown that to calculate probability, we generate a list of all possible values, or all possible range intervals, and allocate them a proportion of 100%. To assess how likely one possibility is to all others, we divide the number of possibilities that match our desired outcome by the total number of possible outcomes.

This chapter also demonstrates how quickly sampling can reduce uncertainty. After nine samples, each additional sample only reduces the uncertainty of any one interval by less than 1%. This is often a surprise, its assumed hundreds or thousands of samples are required to have any significance. For many decisions, a few well taken reliable samples may prove better than a gut instinct or an unintentional cognitive bias.

With sampling you can never be sure, but you can’t be any surer without sampling.

Key points and tips discussed in this chapter:

Uncertainty is limited knowledge about how a future event may play out.
Probability is the number of possibilities that match what we need divided by the total number of all possibilities.
Just because you haven’t observed it yet doesn’t mean you can get away without counting it as a possibility.
Nine to eleven samples give a good indication of the total likely range when sequential uniform numbers are involved.
Although black swans exist, there are a lot more white ones.
We should read Nassim Taleb’s work on the limits of traditional probability and statistics in the face of massive impact, low probability, unforeseeable events.

[1] In statistics, a percentage isn’t represented as 0% and 100%, its represented as 0 and 1 by the coolest kids. Expect to see me fall into 0 to 1 ranges later in this book, and expect to see most software that computes percentages to want a number from 0 to 1.

[2] All outcomes are uncertain, even the sure things and no hope. We just express our understanding of the probabilities.

[3] According the Guinness Book of World records, the greatest fully authenticated age to which any human has ever lived is 122 years 164 days by Jeanne Louise Calment (France). Born on 21 February 1875 to Nicolas (1837–1931) and Marguerite (neé Gilles 1838–1924), Jeanne died at a nursing home in Arles, southern France on 4 August 1997.