Sampling, an Introduction

troy.magennis
Forecasting using data
17 min readJul 7, 2017

Chapter 4

This chapter introduces statistical sampling, and how it has been used in the real world to answer critical questions. Sampling improves decisions where it’s impossible or expensive to get access to every piece of data, as the real world examples demonstrate.

Goals of this chapter –

  • Define statistical sampling — What it is, Why it is used, What makes it reliable
  • Look at three real world uses of statistical sampling answering real questions How many tanks or taxis are there? How many fish are in the pond? Where should I protect my airplane?

Sampling Defined

“Sampling: 1. (Statistics) the process of selecting a random sample”
(Harper Collins Publishers, 2014)

An old statistics joke goes something like this, “If you don’t believe in random sampling, next time you go to the doctor for a blood test, have her take it all.” Doctors use blood testing to make informed medical decisions. Chemical properties in a small sample of blood taken from one part of the body are likely representative of the entire bloodstream. The alternative is to extract all of the blood in your body and test it all, but you might find it uncomfortable and a little riskier, just to get your cholesterol checked. If we can get an accurate cholesterol value from a small sample of blood (and we do), why is it hard to believe that sampling is a powerful predictor of other numerical values or ranges? Whether it be liquid or data, well applied sampling techniques will give a better result than alternatives. The most common alternative being a flat out guess. Have your guesses ever been wrong? If you are like me and have perfect intuition in all circumstances, sampling can be justified as a sanity-check. Not that we need it, we inherited our genetic immunity from cognitive biases and wishful thinking from our parents. But, if you insist, and if sometimes your guesses are wrong, sampling helps you see and correct earlier.

In a perfect world we would have an accurate measure of every piece of data we need, or an honest answer from every person we ask an opinion question. I’d also be retired and own a small café in Paris. Dreams are important, but we often have to make decisions in the real world without access to all data, or all people. We need to make more informed decisions with incomplete and in-exact data. This is sampling in a nutshell; We use sampling when it’s too intrusive, costly, or difficult to gather every piece of data to help understand or make a future decision. Can we be certain without every piece of data? No, but every sample gives us increasing confidence, and this quickly exceeds the (unknowable) certainty of intuition alone.

We use sampling when it’s too intrusive, costly or difficult to gather every piece of data.

When introducing sampling, the first question to answer is — “How many samples before I’m sure?” The only honest answer with sampling is that you can NEVER be absolutely certain. Luck may mean that the sample you never get is the sample that would have changed your decision. Acknowledge this up-front and then guard against making rash decisions when life or death matters. In any case, the question of “perfect certainty” is a red-herring, a more appropriate question is “Will a decision made with some uncertainty reduced be better than a decision based on no real data at all?” and to that the answer is a resounding YES, with the qualification, when performed reliably (as this book teaches).

The three rules for reliable sampling are –

1. Take samples at random (avoid cherry-picking on purpose or accidently).

2. Avoid sampling that may exclude or include some part of the population (censoring).

3. Understand when the results of sampling are likely more correct than other methods (for example, knowing when expert range estimates might be better).

To continue with our blood testing analogy (just a bit longer I swear), an important part of the reliable sample process is that the sample of blood is truly taken at random which is pretty hard to fake this in a blood test. Another important aspect is that the liquid or data is truly mixed from the total available population, again, pretty hard to avoid with a blood test. Reliability of sampling comes from ensuring that no bias enters the sampling process and the samples could and do come from any part of the full population of actual possibilities. For example, adding ingredients to a soup mix and then sampling without thoroughly stirring to combine the ingredients into the mix will give a biased sample. If sampled (using a spoon) where the salt was just dumped in, expect to taste higher salt per spoon than likely. Mix thoroughly with heat added to help the salt dissolve, then we should expect a consistent salt content wherever we sample.

How might sampling be biased in the world of forecasting other work? One easy and often seen biased sample is when using completed work item cycle times to forecast how long remaining work might take. For example, using samples of completed work in the first week of a project both cherry picks and censors the samples. In-complete items that take longer than a week aren’t sampled yet, and it’s unlikely the work in the first week is representative of how the remaining work items will play out, given everything is new and ramping up. It’s almost certain some work will take longer than one week, so we need to wait longer to see them occur, and to learn how often they occur. Sampling will work, but we haven’t yet reached a point where an expert judgment of a cycle time range is defeated by historical samples. Learning how to work out when the balance tips in favor of using data versus an estimate is a key forecasting skill.

No matter how well we sample, biases and censoring is always present in some way. The key is recognizing and letting these biases average out, and admit they exist. Even imperfect sampling offers insights that help form a deeper understanding of the processes at work. To illustrate how even imperfect sampling has been fruitful in answering real world questions, let’s look at some biased historical samples that prove my point! Here are three positive examples, and one negative example with a happy ending.

How many taxis or tanks?

During World War II statisticians were given the job to estimate how many tanks were produced. Tanks entered the battlefield theater and knowing how many tanks you might face in opposition important when planning troop and equipment movement. During June 1940, the intelligence estimate known after official release post war, showed German forces we thought to have 1,000 tanks at its disposal. The question was, how can the initial estimate of 1,000 be double checked?

A group of statisticians working for the US Department of Defense took an alternative sampling approach to the estimate based using part serial numbers observed from captured tanks. The technique is known as the Taxi Formula, which is a common formula used to teach sampling by computing the number of Taxi’s in London based on their displayed numeric identifier. A request was made to report back the serial numbers in key places of abandoned or captured tanks. The story goes there were only two captured tank samples, but there were some parts used in higher quantity on the same tank that had unique serial numbers which boosts the number of samples captured (see story below). This gave a sample set of serial numbers which could be used to estimate likely highest number produced. From this highest serial number, a guess of how many total tanks computed.

Extract from: Great Moments in Statistics by Julian Champkin

Tanks, like taxis, carry numbers — dozens of them. They have chassis numbers, engine numbers, gun-barrel numbers. And some of those numbers run in sequence, from 1 to however many tanks have been made.

Tanks, though, are harder to catch than taxis. The allies had captured just two Mark V’s: one in Sicily, one in Russia. Would two be enough to do the taxi trick and work out how many tanks there were in all? The problem was handed to American statisticians.

The chassis numbers did not help. They knew, from other tanks, that chassis were made by five different manufacturers with big breaks in the number sequences. Gearboxes, on the other hand, were numbered in an unbroken sequence from 1 on up. Even so, two is a pretty small sample. Still, the formula gave an answer.

Fortunately, tanks have bogie wheels that support the tracks; the bogie wheels have rubber tyres[1]; the tyres are made on a mould; and each mould bears a number that also gets moulded onto the tyre. Nor do moulds last forever; the analysts asked British tyre manufacturers how many tyres they would expect a mould to make before it is replaced with a new-numbered one. More fortunately still, each Panther tank had eight axles, and each axle had six bogie wheels, making 48 wheels per tank.

They applied the taxi formula, suitably adjusted, to their 96 differently numbered bogie tyres. They came up with an answer; and it agreed very well with their gearbox answer.

(Champkin, 2013)

The taxi formula, shown in Equation 2 is a simple way the highest number for a serial set of numbers starting at 1 from a set of observed samples.

Equation 2 — The Taxi Formula

How did this sampling formula perform? Table 2 shows pretty well. Consider they had just two tank samples. But, even with that small number of samples, they outperformed expert estimates by an amount large enough to be significant in planning.

Table 2 — Intelligence estimate versus sampling estimate compared with actual post war records [2] Source: (Wikipedia)

Score one for the statisticians. I’m sure the intelligence estimates were padded by multiple well-meaning people to avoid estimating to low, but it’s a big miss that used alone might have changed strategy and put equipment and people in the wrong places. Overestimating isn’t always safe.

How does Equation 2 work? It takes the highest sample you have seen so far and adds a bit more to it. How much more? The average interval size seen between the number of samples you have and the range seen so far. The -1 on the end of Equation 2 correctly compute the number of intervals between the samples, and for larger numbers it my well be more correct, but I find on low numbers it often causes a low estimate. I leave it off most of the time, or wish I had in hindsight. Throw a six sided dice 3 times and see how it performs. For example, I rolled a 1, 3 and 4. The original formula with the -1 gives 4.33, without 5.33. Decide given how being one high, or one low might increase risk. For work time based estimates, it’s safer being a little high, so I leave it off, especially when sample count is less than 30.

Another story goes that similar analysis was performed by taking the average of the sequence of sampled numbers, then doubling it. Or taking the median of the samples and doubling that. For our simple six sided dice roll of 1,3 and 4, doubling the average gives 5.3, and doubling the median of 3 gives an exact and correct value of 6. Pretty good for government work given the simplicity of the technique.

What happens if the range doesn’t start from 1? The original equation can be adjusted by adding the range seen in the samples rather than just the maximum as shown in Equation 3.

Equation 3 — Estimating the highest likely value when a sequence doesn’t start from one

Tip: Average multiple methods

Calculate using multiple estimation methods where possible. Take the average of those answers to account for any one method being more wrong than the others.

For the tank estimation problem, averaging the analyst estimate 1000, with the gearbox estimate 169 and the dolly wheel estimate of 175 gives an answer of near 500. This might be a safe blend of human intuition and statistical rigor!

How many fish are in the pond?

It’s common to want to know how many fish are in a pond. Sure, you say, how often does that come up in real life. Depends how much marine biology comes up in your career path, but the same logic and techniques can be applied to a variety of problems relating to estimating total numbers.

Drain the pond is the first and easiest solution that comes to mind. But just like blood testing, we want the patient to be alive after we do our work. So, killing all the fish is a non-starter. The actual technique used is a little more difficult, but more fun as we will see. It’s called capture — re-capture.

We go to the pond and catch say 10 fish with soft nets, or old fashioned fishing tackle using no-harm removable hooks, we need the fish alive for this process to work. These fish are lovingly tagged for identification in some way and then carefully returned to the pond. Now, as any marine biologist knows, you need to now go to the local bar/pub and have a drink or two depending on mood, climate and nationality. We need the returned fish to swim and mix equally back in the pond with all the other tagged and untagged fish. After sobering up, we return to the pond and catch 10 more fish at random. Let’s say 3 of the captured fish are tagged.

The total number of fish in the pond is the ratio of the fish we re-captured that were tagged by the number of known tagged fish in the pond.

Equation 4 — Calculating capture, re-capture ratios

For our example,

Applying a bit of algebra simplification (ask your kids, or read chapter 6) –

Total fish in pond estimate is 33.3 fish, let’s just say 34 so this doesn’t get creepy.

To understand why this technique works, consider we captured 10 fish, tagged and returned them. When we re-capture we caught the SAME 10 tagged fish. It would be almost certain that there aren’t any other fish in the pond. The chances of catching the same 10 fish and NO other untagged fish purely by chance is very, very small. It’s much more likely that there are 10 fish — the 10 you tagged. Now, consider the inverse you captured 10 new fish (untagged). It’s more likely in this case that there are many, many more fish in this pond. The truth is somewhere in between these extremes.

This technique is also used in search procedures to estimate coverage rigor. Noticeable items are placed randomly within a search area prior to detailed searching, think of them as physical equivalents of tagged fish. During the search, the coverage percentage can be determined by how many of the pre-planted items are located. If 10 were placed, search can be said to have covered 80% of the area when 8 pre-planted items are found. Often this is how a ground and ocean based search and rescue search rigor is assessed.

We can use the similar techniques for estimating how many defects might remain un detected is a piece of software. If we know certain bugs already exist, we can see how many duplicated are reported during testing. Another more forceful technique seeds documents and code with known defects and confirms how many are found. It can be performed rigorously to estimate just how well a feature or application has been tested and how many defects might remain. For example, 10 defects or coding standard violations can be inserted to a codebase. If 8 of those are found, you’re are roughly 80% certain 80% of all coding standard violations have been addressed (8 out of 10 planted violations found, so 8 out of 10 other violations found as well). This might not be a test run continuously, but performing this experiment from time to time (a sample in its own right) should indicate how thorough code reviews are being carried out. We cover these in Chapter 5.

No Brown M&M’s — The Van Halen Ultimatum

I’ve spoken at a few conferences. I’m totally cool if I get free access to the rest of the conference, but appreciate the little things. A bottle of water and a power outlet to plug in my laptop at the podium is all I ask for. The rock concert artists though, they need much more. Lady Gaga for instance asks for a “mannequin with puffy pink hair”[3]. Van Halen has an extensive list of contract clauses about what is to be in their band room. One that stands out among the fifty-one pages and has the stiffest of penalties if ignored is -

“M&Ms (WARNING: ABSOLUTELY NO BROWN ONES)”[4]

This clause was within the technical detail section of the contract, very close to the section about technical and safety details about the venue and stage setup. On the face of it, it just seems like any other indulgent pink hair brained requirement. It turns out, if David Lee Roth’s autobiography (Roth, 1998) is to be believed, a clever use of the sample — resample process with the purpose of increasing band, crew and fan safety.

Van Halen were growing in popularity and playing venues that weren’t used to such lavish stage shows. Turns out to make the guitar solos hit the required intensely painful tone, heavier sound equipment than older venues could safely hold is needed. Major overhead lighting was necessary to improve the looks of some band members, and to increase the intensity of the show. Music wasn’t just music anymore; it was an immersive experience. A lot could go wrong with these setups, and Van Halen wanted to ensure that the safety aspects of the contract were read and implemented to the letter, so everyone could go home alive after the show.

When the band entered the band room, if brown M&M’s were present, the chances were high that the promoter didn’t pay attention to the detail in the contract. It put into question if they read any of the safety clauses. Sorting the brown M&M’s out, whilst time consuming, is not costly. Ignoring this clause was pure tardiness about detail. Detail that could avoid serious accidents. Brown M&M’s triggered a full safety audit of the venue. Clever.

Ding. Excuse me, there is a hole in my plane

To demonstrate that not all sampling techniques are well applied, this is a story of poor logic. This story also plays out during World War II, and shows how you can be a hero one day and a zero the next.

Not all airplanes are fast and maneuverable. Bombers have a specific job. Carry heavy bombs into enemy territory and drop as many of them as possible on a pre-determined spot. They are the trucks of the sky, and in the cross hairs of faster aircraft that just have to score a few well-placed bullet hits to make a big bang. I’m not an insurance actuary, but if I was, an increase the life insurance premiums for bomber aircrews seems prudent.

Researchers from the Center for Naval Analyses were again given the job of minimizing bomber loss due to enemy fire. They analyzed returning bombers and noted where the bullet holes were clustered, and recommended re-enforcing those regions. I’m not sure what is stranger, the initial report was signed off, or that people actually welded more metal in the specified locations making the bombers heavier, slower and therefore more susceptible to being struck in flight by projectiles.

It took another researcher, Abraham Wald (Wikipedia, 2015) to suggest that those bombers returning with damage survived, those that were hit in other locations didn’t return (Wald, 1943). Areas seen without damage are likely to be a better investment of more armor if survivability is the goal. An astute and potentially life-saving observation.

A Method of Estimating Plane Vulnerability Based on Damage of Survivors.

Researchers from the Center for Naval Analyses had conducted a study of the damage done to aircraft that had returned from missions, and had recommended that armor be added to the areas that showed the most damage. Wald noted that the study only considered the aircraft that had survived their missions — the bombers that had been shot down were not present for the damage assessment. The holes in the returning aircraft, then, represented areas where a bomber could take damage and still return home safely. Wald proposed that the Navy instead reinforce the areas where the returning aircraft were unscathed, since those were the areas that, if hit, would cause the plane to be lost.

(Wald, 1943)

Abraham Wald noted that the original samples were biased. They were only samples of returning bombers. Those bombers that didn’t make it, didn’t get analyzed. This is a classic example of selection bias. The only (easily) available samples were analyzed as if they were the whole population. And then they were analyzed as if a bullet whole was a bad outcome, rather than saying a bullet hole in this location is survivable. It is a warning that even groups of researchers can fall prey to poor sampling techniques and make poor recommendations based on poor logic. Putting a positive spin on it though, because the technique used by the original Center for Analyses team were published, Abraham Wald was able to evaluate the technique used. It’s important whenever we make a decision based on sampling that we also document how we arrived at the answer so those smarter than us can critique our work, potentially saving lives by shooting holes in our own flawed process or techniques (too soon for that pun?).

Another important insight this story highlights is that there is information where there is lack of data. The lack of bullet holes is an important piece of data in its own right. Selection bias, once identified can help show what isn’t occurring in samples as expected. Absence of data, as in this case is valuable and a decisive piece of the puzzle in its own right. A red flag should raise in your head when you don’t see defects being reported in one feature or from one team. Maybe it is perfect code, or a perfect team like mine; most likely though, lack of data might be evidence that bad news is to come. Investigate why.

Tip: Look for the missing, not just the available

Absence of expected data, is data in its own right.

Summary

I hope that you are now more comfortable when getting a blood test sample taken and will avoid a full blood transfusion next time you need your Cholesterol checked.

This chapter introduced the basic concepts of statistical sampling techniques with real world successes and failures. Sampling is a reliable technique when used with care. Later in this book we will show how to carefully apply sampling into our domain of expertise and delve into the details of how to estimate how sure you can be with a certain number of samples.

Key points and tips discussed in this chapter

  • Sampling even with just a few samples outperforms expert estimates.
  • Averaging multiple methods balances different strengths and weaknesses.
  • Absence of expected data is data; Look for samples you can’t see.
  • Sharing your process helps others improve your decision logic by spotting mistakes in process and thinking.

[1] I know this spelling of “tyres” and “mould” potentially might offend some readers. Since this is an extract from a British journal, I respect the original authors spelling diversity.

[2] In researching this, I found a variety of reported actuals and estimates. I settled on this data based on consensus and discussion on Wikipedia in the article German Tank Problem and because it was the worst. result documented. It differs in dates and number only slightly in articles from the Royal Statistical Society article which show better estimates than the results shown here.

[3] See Lady Gaga’s requests here http://www.mirror.co.uk/3am/weird-celeb-news/lady-gagas-tour-demands-14-page-1588382

[4] See the full Van Halen contract here http://www.thesmokinggun.com/documents/crime/van-halens-legendary-mms-rider

--

--