Bootstrap Estimates and Experimentation
Monte Carlo Estimation Part I
Many digital experiments carried out these days can be happily analyzed using frequentist statistical methods since we expect the data to be normally distributed. However, there are certain metrics that are not so easy. Take, for example, an experiment designed to increase tax collections, or to pay an invoice. If all the invoices were either fully paid, or not paid at all, this could be analyzed with a binomial proportion test or a chi-squared test. However, a certain fraction of the invoices will be partially paid, making these tests unsuitable. Even if we tried to compare the actual bill distributions, these are likely to be some form of exponential distribution with many small bills and fewer large bills, as shown below.
So what to do? One way to handle this is to use random sampling statistical tools, often referred to as Monte Carlo Simulations. Two that are particularly relevant in assessing the results of an experiment are bootstrapping (which is useful for determining the distribution of a statistical parameter like the mean) and permutation testing (used to determine if there is a difference in a statistic between two distributions, and how likely that difference is to be real as opposed to by chance). In this post we will deal with bootstrapping.
Bootstrapping uses the concept of sampling-with-replacement to generate the distribution of a parameter. To illustrate this, let’s say we want to estimate the average height of twenty flowers. Without access to a computer, we would go about this as follows:
- Measure each flower and record the flower ID and height on a piece of paper
- Calculate the average height of the flowers in your sample and write this down.
And now for the fun and (slightly) magical part…
- Put the pieces of paper in a hat and choose one at random.
- Write down the height of the flower you chose, and put the paper back in the hat.
- Choose again at random- you might choose the same one again! This is called sampling-with-replacement.
- Repeat 20 times to correspond with the original sample size. These should be the same.
- Calculate the average AGAIN and write this down- that is the second estimate of the average.
If you repeat the above 1000 times, you will end up with 1000 slightly different estimates for the average- and those estimates will be:
- Normally distributed,
- Provide the standard error of the mean value of your estimate of the average height of the flowers.
In (very unoptimized but perhaps more readable?!) Python we can do the above as follows:
#generate a few normally distributed heights for demonstration
N_flowers = 20
flower_heights = np.random.normal(5,size = N_flowers)data_points = flower_heights
number_of_bootstraps = 1000
average_estimate = [np.mean(data_points)]for j in range(0,number_of_bootstraps):
random_sample = []
for i in range(0,len(data_points)):
random_sample.append(np.random.choice(data_points))
average_estimate.append(np.mean(random_sample))ax = sns.distplot(average_estimate, bins = 100).set_title('Average Flower Heights')
plt.xlabel("Height [cm]")
plt.ylabel("Count")
As a sanity check, it’s worth comparing the bootstrapped standard error with the standard error we would expect from a normal distribution using frequentist methods. The standard error of the mean (SE) for a normal distribution with standard deviation σ, and sample size N, is given below:
We can compare the above to our bootstrapped estimate by using:
#normal distribution estimate
std_error_mean = np.std(flower_heights) / np.sqrt(N_flowers) -- 0.1983#bootstrap estimate
bootstrap_estimate = np.std(average_estimate) -- 0.1977
I hear some of you muttering in the background- big deal! We could have done this without all this random choice nonsense….
BUT… what if (instead of flowers) we had some other sample that wasn’t normally distributed? What about bills which typically have some sort of exponential distribution?
What is the mean of the bills shown below? More importantly for experimentation (where we’re trying to determine if there is a difference between two means or other statistical parameter) what is the standard error on that mean estimate ?
bill_size = np.random.pareto(1,200)
sns.distplot(pareto).set_title("Average Bill Size")
plt.xlabel("Bill ($)")
plt.ylabel("Count")
That's a bit more tricky isn’t it…
Let's try estimating this using a bootstrap technique, but this time let’s tidy things up a bit and use a function…
def get_bootstrap_mean(data_values, n_bootstraps):
'''
data_values: Pandas Dataframe
n_bootstrapes: Number of bootstrap estimates to calculate
Return:
Pandas Dataframe of the mean estimates
''' bootstrap_means = []
for j in range(0, n_bootstraps):
sample = data_values.sample(frac=1, replace=True).copy()
bootstrap_means.append(sample.mean())
return(pd.DataFrame(bootstrap_means))bill_size_experiment_a = pd.DataFrame(np.random.exponential(3,10000) )
average_estimate_a = get_bootstrap_mean(bill_size_experiment_a, 100)ax = sns.distplot(average_estimate_a, bins = 100).set_title('Average Bill Size Experiment Arm A')
plt.xlabel("Bill Amount ($)")
plt.ylabel("Count")
Great! But what does this mean for experimentation? Well, for experiments we would like to convince ourselves that the intervention has actually had an effect. For example, imagine we ran an experiment that was designed to increase the average repayment? How would we know it had worked? We could compare the bootstrapped mean distributions as below:
bill_size_experiment_b = 1.1 +pd.DataFrame(np.random.exponential(3,10000) )
average_estimate_b = get_bootstrap_mean(bill_size_experiment_b, 100)
And then compare the two histograms as done below (where A and B are experiment arms A and B respectively):
fig, ax = plt.subplots()
for a in [average_estimate_a, average_estimate_b]:
sns.distplot(a, bins = 20, ax=ax, kde=False)
plt.legend(["A","B"])
plt.title("Comparison of the Bill Bootstrapped Means")
plt.xlabel("Bill Amount ($)")
plt.ylabel("Count")
A great thing, is that we can now use a t-test to compare the two mean distributions
t2, p2 = stats.ttest_ind(average_estimate_a,average_estimate_b)
For the above we get a p~0 i.e. there is a nearly zero percent probability that the means of the two samples are the same. If these were representative of two experimental conditions, we could safely assume that there is a difference between them.
This is a VERY powerful technique- it can be used to estimate the standard error of any statistical parameter no matter what the underlying data distribution. Also, the resulting estimate is normally distributed, making it easy to compare using standard frequentist techniques.
In the next post I will describe permutation tests. These are used specifically to determine the likelihood that a parameter differs between two distributions.