Permutation Testing and Experimentation

Charles Copley
The Patient Experience Studio at Cedar
4 min readSep 21, 2020

Monte Carlo Estimation Part II

Photo by Jean-Louis Paulin on Unsplash

At the heart of any Monte Carlo technique, is randomization. We have already learned about the bootstrap technique for estimating a statistical parameter. This can be used to compare two distributions (using a t-test for example), but there is another Monte-Carlo technique used to determine if two populations are different, know as permutation testing.

To illustrate this, imagine we’re in a casino and we’ve been on a long losing streak. We want to determine if the dealer is shuffling the cards randomly, or is actually (gasp!) cheating!

Photo by Jack Hamilton on Unsplash

As in Part I, it is always useful to simulate data so that you understand what you are testing. Below we generate a deck of virtual cards to play around with:

def generate_pack():
suits = ['hearts','spades','clubs','diamonds']
values_str = list(map(str, range(2, 11)))
cards = ['ace'] + values_str + ['jack','queen','king']
pack = []
card_id = 0
for suit in suits:
card_val = 1
for card in cards:
card = {"card_id":card_id, "card_val":card_val, "card_suit" : suit, "card_name" : card}
card_id +=1
card_val += 1
pack = pd.DataFrame(generate_pack())

If the cards were shuffled correctly, we should be able to take the deck, split it in two, and have similar average card values in the two piles. On the other hand, if the dealer were cheating, we could expect that high value cards were nearer the top of the pack to make them accessible. Let’s simulate the case where the dealer is cheating by keeping all the face cards (the Jack, Queen, and King) at the top of his pack.

pack_1_choices = ['jack','queen','king']pack_1 = pack[pack['card_name'].isin(pack_1_choices)]
pack_2 = pack[~pack['card_id'].isin(pack_1_choices)]

And now let’s see what the mean face value difference is between the two piles (Ace has a value of 1, Jack 11, Queen 12 and King 13. All the rest have the same face value as their number):

ref_value = pack_1['card_val'].mean() - pack_2['card_val'].mean() --- 5.0

If the two piles were shuffled correctly we would expect to have a mean difference of around 0 since both the face values would be distributed evenly.

We’re getting 5.0 which is much higher! In order to determine how different, we now generate the same statistic under the null hypothesis (i.e. there is no discernible difference between the packs). We do this by shuffling the cards and splitting the pack in two again. This two piles should have have a fairly even distribution of face values. We then calculate the statistic and repeat multiple times.

combined = pack_1.append(pack_2)
n_shuffles = 1000
diff = []
for i in range(0,n_shuffles):
combined = combined.sample(frac = 1)
group_a = combined[ / 2)]
group_b = combined[( / 2 )) + 1]
diff.append(group_b['card_val'].mean() - group_a['card_val'].mean())
ax = sns.distplot(diff, bins = 20).set_title('Null Distribution')
plt.xlabel("Mean Difference")

Looking at the above, it seems very unlikely that our cards were shuffled correctly in the first place! We can quantify this by using number of times our reference mean difference value was less than the mean difference values obtained during the shuffling operations- this is effectively the p-value.

p_val = len(np.where(diff >= reference_difference)[0]) / 1000 -- 0.0 

We get p= 0.0 and so we can be pretty sure (as we’d expect) that the shuffling was not done correctly.

This is a very powerful technique. We can use it to determine the significance of the difference of any value, not just the mean! And we can use it for non-normal distributions! Lets do that now and compare two typical invoice collection distributions. In this case we have slightly higher collections in collections_experiment_b than in collections_experiment_a.

collections_experiment_a = pd.DataFrame(np.random.exponential(3,10000) )collections_experiment_b = 0.5 +pd.DataFrame(np.random.exponential(3,10000) )ref_value = bill_size_experiment_b.mean() - bill_size_experiment_a.mean()
ref_value -- 0.479

Now we want to know how confident we are that Experiment B is outperforming Experiment A. We generate the null hypothesis distribution of the mean difference (i.e. the case where there is no difference in means) by shuffling our values and recalculating the mean after each shuffle:

combined = bill_size_experiment_a.append(bill_size_experiment_b)
diff = []
for i in range(0,1000):
combined = combined.sample(frac = 1)
group_a = combined[ / 2)]
group_b = combined[( / 2 )) + 1]
diff.append((group_b.mean() - group_a.mean())[0])

And now we estimate the likelihood that the two distributions are the same.

p_val = diff_pd[diff_pd > ref_value[0]].count().values[0] / len(diff_pd) -- 0

Used together, permutation tests and bootstrap estimates are very powerful. They provide an elegant way of handling real-world situations where the underlying distribution is not normal, and are also surprisingly intuitive.