Bootstrap resampling: how to do, and understand, statistics without knowing any statistics

Statistical methods have a reputation for being difficult: elaborate mathematical methods whose rationale is obscure, yielding answers whose meaning is often equally obscure. This reputation is entirely deserved.

But it doesn’t need to be this way. There are simpler, more transparent approaches which avoid many of the problems. One such family of approaches uses computer simulation to do experiments to see how likely various possibilities are. One member of this family is called the bootstrap, which is the subject of this article. I’ll use a very simple example to explain how it works and how to do it, and then look at its advantages over conventional approaches.

Imagine that these numbers are the scores of 9 people on something important to some research:

74, 65, 57, 78, 54, 47, 38, 34, 93

They might measures of how satisfied people are with a product or a job or life in general, or some measure of how well a medical treatment works.

The average (mean) score is 60, but the individual scores range from 34 to 93. We’ll focus on the average but the same method can be used for lots of other things (about which more below).

The question which bootstrapping, and probably most of statistical theory, answers is how certain we can be in extrapolating any conclusions we draw from this data beyond these 9 people. What do we mean by “beyond”? Again there are various possibilities, but for now I’ll assume that the 9 people are part of a much larger group which is the real focus of the research. We aren’t particularly interested in these 9 people; we want to know the overall pattern. The conventional terms for the small group we are studying is a sample, and the larger group we would like to extrapolate our results to is the population. Can we assume the average is 60 for the whole of the population? Obviously not because the sample is small and the numbers range from 34 to 93, so chance will mean that the estimated average of 60 might be too high or too low. But how much too high or low?

The idea behind bootstrapping is to do some experiments to see how much the averages of random samples of 9 from this sort of population are likely to vary. The difficulty, of course, is that we’ve only got a sample of 9. We don’t know about the whole population. The trick we use is to assume that this data does give a reasonable impression of what the whole population would look like. To be precise, we assume that one ninth of them (11.1%) are 74s, a ninth are 65s, and so on. This is obviously just a guess, so I’ll call it a guessed population.

Now we just take random samples of 9 from this guessed population. You could do this by writing all nine scores on a card, shuffling the pack and dealing one, then replacing it so that the pack contains the same nine cards, shuffling and dealing another, and so on nine times. This means that each card dealt is equally likely to be any of the nine scores in the sample of data — which is what we would get if we had a large population with same pattern as the sample. The first three resamples I generated in this way were:

Resample 1: 93, 34, 54, 93, 65, 47, 78, 54, 93, Average=67.9

Resample 2: 74, 78, 34, 65, 54, 93, 78, 38,78, Average=65.8

Resample 3: 47, 74, 38, 57, 78, 47, 74, 47, 34, Average=55.1

Notice here that, for example, 93 appears three times in the first resample, 54 appears twice and several numbers don’t appear at all, so the resample is not the same as the sample. These three resamples do vary quite a lot: from an average of just over 55 to almost 68. Clearly one sample of 9 is not a reliable guide.

But this is only three resamples. When I took 1000 resamples like this the pattern looked like the histogram at the top of this article. There were, for example, 128 resamples which fell into the bar labelled 60 — this is all the averages from 59 to 61 which would be 60 to the nearest even number. The first resample above with an average of 67.9 would be one of the 66 resamples in the bar labelled 68. The lowest resample average was 42 (just one in this bar 18 units below the overall average of 60) and the highest was 78 (three resamples in this bar 18 units above the overall average).

This suggests that the chance of a sample of 9 getting the population average correct to the nearest even number is about 13% (128/1000), and the chance of the sample being correct to within 18 units is almost 100%.

Statisticians like 95% (not sure why). The commonest inference from a diagram like this is a 95% confidence interval. In this case we can be 95% confident that the overall average is between about 48 and 72 because 2.5% of the resamples are below 48 and 2.5% are above 72, leaving 95% in the middle. This can easily be worked out from the list of all 1000 resample averages, and should be roughly obvious from the diagram. This is a conventional way of assessing the accuracy of samples. Confidence intervals are often wider than you might expect, indicating estimates from samples may be less reliable than you might expect.

Another conclusion we might draw is that the probability of the overall mean being more than 50 is about 95% because 952 of the 1000 resamples had an average of more than 50. (Strangely, this is not a question to which conventional statistics provides an answer — see https://arxiv.org/abs/1702.03129v2.)

This method is called bootstrapping because we’ve got an answer about how we can generalise a sample from the sample alone — which feels like it shouldn’t be possible like pulling yourself by your bootstraps.

Obviously I didn’t physically deal cards to get the 1000 averages. The method is only practical with a computer: I used the spreadsheet at http://woodm.myweb.port.ac.uk/SL/resample9bootstrap.xlsx which is intended to be self-explanatory (if in doubt click the Introduction tab at the bottom of the screen).

Another question you might wonder about is what happens if we make the sample larger — say four times as large (36 instead of 9). You’d expect a more accurate answer, but how much more accurate? This experiment is easy to carry out with the spreadsheet: go to the Single resample sheet (tab at the bottom), click on the brown cell on the right and change the menu setting to Resample size in green cell below, which should be set to 36. Then click on the Lots of resamples tab and you should see that the 95% interval has changed to about 54 to 66. Its width is now half of what it was before.

It should be obvious how the bootstrap method works and the rationale behind it. Some of the snags should also be fairly obvious. Our guessed population only has 9 distinct numbers (74, 65, etc) whereas real populations are likely to have most of the numbers in between as well. If the pattern of the resamples is not roughly symmetrical like the diagram above, the implicit assumption (e. g. in the phrase “within 18 units”) that the probability of overestimating the average is roughly the same as the probability of underestimating it may not be reasonable. And if the sample is not a random one, it may not give a reasonably idea of the real population as a basis for experimenting, and of course the method simulates samples which are chosen randomly and if the real sample was not a random choice the simulation may give misleading results.

There is a formula in statistics textbooks for working out 95% confidence intervals for population averages based on probability theory. The answer this gives for our example is that interval extends from 47.3 to 72.7, which is very close to the bootstrap answer above — despite the crudeness of the method, and the snags pointed out in the previous paragraph. In some ways, you may feel the textbook formula is easier: plug in the numbers and out pops the answer. There’s even an Excel formula for it.

But … understanding the rationale behind the formula, what it’s doing and why it works, is complicated. You need to know which formula to use, how it works and what the answer means (this is even tricky with the Excel formula which is not at all user-friendly). It’s not just the formula itself but the fact that it involves the standard deviation and the t distribution, conventionally obtained from tables. The standard deviation is a minor technicality, but the t distribution is based on some very advanced mathematics with which very few users of statistics will be familiar. It’s a black box which has to be taken on trust.

And it gets worse for the conventional approach to statistics. The spreadsheet above will analyse a median, a proportion, the difference of two sample means or proportions (for comparing two samples), a correlation coefficient — all in essentially the same way. The bootstrap method can be used for all of these, and lots of other statistics. Conventional theory will do the same, but you need a different formula for each statistic, and often a different set of tables.

The bootstrap approach can be applied to a lot of different statistics. It’s a crude general method for crunching out answers in a very wide range of contexts. You don’t need to know anything about standard deviations, t distributions and all the other formulae and concepts that appear in statistics texts. You can even analyse problems for which no formula has been invented.

Bootstrapping also gives more a feel for the fact that statistics is about describing randomness. With the spreadsheet you can see each random resample, and so where the answers come from. You will get a slightly different answer each time you press the recalculate button, which may be frustrating if you want “right” answer, but is better regarded as a reminder that certainty is not possible in statistics.