From Groups to Individuals: Permutation Testing

Published in

The Startup

12 min readAug 9, 2020

Approaching group data (between-group)

A typical situation regarding solving an experimental question using a data-driven approach involves several groups that differ in (hopefully) one, sometimes more variables.

Say you collect data on people that either ate (Group 1) or did not eat chocolate (Group 2). Because you know the literature very well, and you are an expert in your field, you believe that people that ate chocolate are more likely to ride camels than people that did not eat the chocolate.

You now want to prove that empirically.

I will be generating simulation data using Python, to demonstrate how permutation testing can be a great tool to detect within-group variations that could reveal peculiar patterns of some individuals. If your two groups are statistically different, then you might explore what underlying parameters could account for this difference. If your two groups are not different, you might want to explore whether some data points still behave “weirdly”, to decide whether to keep on collecting data or dropping the topic.

# Load standard libraries
import panda as pd 
import numpy as np
import matplotlib.pyplot as plt

Now one typical approach in this (a bit crazy) experimental situation would be to look at the difference in camel riding propensity in each group. You could compute the proportions of camel riding actions, or the time spent on a camel, or any other dependent variable that might capture the effect you believe to be true.

Generating data

Let’s generate the distribution of the chocolate group:

# Set seed for replicability
np.random.seed(42)# Set Mean, SD and sample size
mean = 10; sd=1; sample_size=1000# Generate distribution according to parameters
chocolate_distibution = np.random.normal(loc=mean, scale=sd, s
size=sample_size)# Show data
plt.hist(chocolate_distibution)
plt.ylabel("Time spent on a camel")
plt.title("Chocolate Group")

Figure 1 | Histogram depicting the number of people that rode the camel in the chocolate group, per minute bin.

As you can see, I created a distribution centered around 10mn. Now let’s create the second distribution, which could be the control, centered at 9mn.

mean = 9; sd=1; sample_size=1000
non_chocolate_distibution = np.random.normal(loc=mean, scale=sd, size=sample_size)
fig = plt.figure()
plt.hist(non_chocolate_distibution)
plt.ylabel("Time spent on a camel")
plt.title("Non Chocolate Group")

Figure 2 | Histogram depicting the number of people that rode the camel in the Non Chocolate group, splited per minutes bin.

OK! So now we have our two simulated distributions, and we made sure that they differed in their mean. With the sample size we used, we can be quite sure we would have two significantly different populations here, but let’s make sure of that. Let’s quickly visualize that:

Figure 3 | Both chocolate and non chocolate distributions seen together.

We can use an independent sample t-test to get an idea of how different these distributions might be. Note that since the distributions are normally distributed (you can test that with a Shapiro or KS test), and the sample size is very high, parametric testing (under which t-test falls) is permitted. We should run a Levene’s test as well to check the homogeneity of variances, but for the sake of argumentation, let’s move on.

from scipy import stats
t, p = stats.ttest_ind(a=chocolate_distibution, b=non_chocolate_distibution, axis=0, equal_var=True)
print('t-value = ' + str(t))
print('p-value = ' + str(p))

Output of an independent sample t test between the two distributions.

Good, that worked as expected. Note that given the sample size, you are able to detect even very small effects, such as this one (distributions’ means differ only by 1mn).

If these would be real distributions, one would have some evidence that chocolate affects the time spent riding a camel (and should of course dig down a bit more to see what could explain that strange effect…).

I should note that at some point tough, this kind of statistics become dangerous because of the high sample size, that outputs extremely high p-values for even low effects. I discuss a bit this issue in this post. Anyway, this post is about approaching individual data, so let’s move on.

Approaching individual data (within-group)

Now let’s assume that for each of these participants, you recorded multiple choices (Yes or No) to ride a camel (you probably want to do this a few times per participants to get reliable data). Thus, you have repeated measures, at different time points. You know that your groups are significantly different, but what about `within group` variance? And what about an alternative scenario where your groups don’t differ, but you know some individuals showed very particular behavior? The method of permutation can be used in both cases, but let’s use the scenario generated above where groups are significanly different.

What you might observe is that, while at the group level you do have a increased tendency to ride camels after your manipulation (eg, giving sweet sweet chocolate to your subjects), within the chocolate group, some people have a very high tendency while others are actually no different than the No Chocolate group. Vice versa, maybe within the non chocolate group, while the majority did not show an increase in the variable, some did (but that effect is diluted by the group’s tendency).

One way to test that would be to use a permutation test, to test each participants against its own choice patterns.

Data background

Since we are talking about choices, we are looking at a binomial distribution, where say 1 = Decision to ride a camel and 0 = Decision not to ride a camel.
Let’s generate such a distribution for a given participant that would make 100 decisions:

Below, one example where I generate the data for one person, and bias it so that I get a higher number of ones than zeros (that would be the kind of behavior expected by a participant in the chocolate group

distr = np.random.binomial(1, 0.7, size=100)
print(distr)

# Plot the cumulative data
pd.Series(distr).plot(kind=’hist’)

Figure 4 | Count of each binary value generated from the binomial distribution. Since we biased the draw, we obtain a higher number of ones than zeros.

We can clearly see that we have more ones than zeros, as wished.

Let’s generate such choice patterns for different participants in each group.

Generating choice data for all participants

Let’s say that each group will be composed of 20 participants that made 100 choices.

In an experimental setting, we should probably have measured the initial preference of each participant to like camel riding (maybe some people, for some reason, like it more than others, and that should be corrected for). That measure can be used as baseline, to account for initial differences in camel riding for each participant (that, if not measured, could explain differences later on).

We thus will generate a baseline phase (before giving them chocolate) and an experimental phase (after giving them chocolate in the chocolate group, and say another neutral substance in the non chocolate group (as a control manipulation).

A few points:
1) I will generate biased distributions that follow the pattern found before, i.e., that people that ate chocolate are more likely to ride a camel.
2) I will produce baseline choice levels similar between the two groups, to make the between group comparison statistically valid. That is important and should be checked before you run more tests, since your groups should be as comparable as possible.
3) I will include in each of these groups a few participants that behave according to the pattern in the other group, so that we can use a permutation method to detect these guys.

Below, the function I wrote to generate this simulation data.

def generate_simulation_data(nParticipants, nChoicesBase,             nChoicesExp, binomial_ratio):    """
    Generates a simulation choice distribution based on parameters
    Function uses a binomial distribution as basis
    params: (int) nParticipants, number of participants for which   we need data    params: (int) nChoicesBase, number of choices made in the baseline period    params: (int) nChoicesExp, number of choices made in the experimental period    params: (list) binomial_ratio, ratio of 1&0 in the resulting binomial distribution. Best is to propose a list of several values of obtain variability.
    """    # Pre Allocate
    group = pd.DataFrame()    # Loop over participants. For each draw a binonimal choice distribution    for i in range (0,nParticipants):        # Compute choices to ride a camel before drinking, same for both groups (0.5) 
        choices_before = np.random.binomial(1, 0.4, size=nChoicesBase)         # Compute choices to ride a camel after drinking, different per group (defined by binomial ratio)        # generate distribution
        choices_after  = np.random.binomial(1, np.random.choice(binomial_ratio,replace=True), size=nChoicesExp)        # Concatenate
        choices = np.concatenate([choices_before, choices_after])        # Store in dataframe
        group.loc[:,i] = choices
    return group.T

Let’s generate choice data for the chocolate group, with the parameters we defined earlier. I use binomial ratios starting at 0.5 to create a few indifferent individuals within this group. I also use ratios > 0.5 since this group should still contain individuals with high preference levels.

chocolate_grp = generate_simulation_data(nParticipants=20,              nChoicesBase=20, nChoicesExp=100, binomial_ratio=[0.5,0.6,0.7,0.8,0.9])

Caption of the generate data for the chocolate group

As we can see, we generated binary choice data for 120 participants. The screenshot shows part of these choices for some participants (row index).

We can now quickly plot the summed choices for riding a camel for each of these participants to verify that indeed, we have a few indifferent ones (data points around 50), but most of them have a preference, more or less pronounced, to ride a camel.

def plot_group_hist(data, title):
    data.sum(axis=1).plot(kind='hist')
    plt.ylabel("Number of participants")
    plt.xlabel("Repeated choices to ride a camel")
    plt.title(title)plot_group_hist(chocolate_grp, title=' Chocolate group')

Figure 5 | Histogram showing the number of participants falling in each bin. Bins represent the number of decisions made to ride a camel.

Instead of simply summing up the choices to ride a camel, let’s compute a single value per participant that would reflect their preference or aversion to camel ride.

I will be using the following equation, that basically computes a score between [-1;+1], +1 reflecting a complete switch for camel ride preference after drinking, and vice versa. This is equivalent to other normalizations (or standardizations) that you can find in SciKit Learn for instance.

Now, let’s use that equation to compute, for each participant, a score that would inform on the propensity to ride a camel. I use the function depicted below.

def indiv_score(data):    """
    Calculate a normalized score for each participant
    Baseline phase is taken for the first 20 decisions
    Trials 21 to 60 are used as actual experimental choices
    """    # Baseline is the first 20 choices, experimental is from choice 21 onwards
    score = ((data.loc[20:60].mean() - data.loc[0:19].mean())
        / (data.loc[20:60].mean() + data.loc[0:19].mean())
        )
    return scoredef compute_indiv_score(data):    """
    Compute score for all individuals in the dataset
    """    # Pre Allocate
    score = pd.DataFrame(columns = ['score'])    # Loop over individuals to calculate score for each one
    for i in range(0,len(data)):        # Calculate score
        curr_score = indiv_score(data.loc[i,:])        # Store score
        score.loc[i,'score'] = curr_score    return scorescore_chocolate = compute_indiv_score(chocolate_grp)
score_chocolate.plot(kind='hist')

Figure 6 | Number of participants that feel in each score bin in the chocolate group.

We can interpret these scores as suggesting that some individuals showed >50% higher preference to ride a camel after drinking chocolate, while the majority showed an increase in preference of approximately 20/40%. Note how a few individuals, although pertaining to this group, show an almost opposite pattern.

Now let’s generate and look at data for the control, non chocolate group

plot_group_hist(non_chocolate_grp, title='Non chocolate group')\

Figure 7 | Number of participants that feel in each score bin in the non chocolate group.

We can already see that the number of choices to ride a camel are quite low compared to the chocolate group plot.

OK! Now we have our participants. Let’s run a permutation test to detect which participants were significantly preferring riding a camel in each group. Based on the between group statistics, we expect that number to be higher in the chocolate than in the non chocolate group.

Permutation test

A permutation test consists in shuffling the data, within each participant, to create a new distribution of data that would reflect a virtual, but given the data possible, distribution. That operation is performed many times to generate a virtual distribution against which the actual true data is compared to.

In our case, we will shuffle the data of each participant between the initial measurement (baseline likelihood to ride a camel) and the post measurement phase (same measure after drinking, in each group).

The function below runs a permutation test for all participants in a given group.
For each participant, it shuffles the choice data nReps times, and calculate a confidence interval (you can define whether you want it one or two sided) and checks the location of the real choice data related to this CI. When outside of it, the participant is said to have a significant preference for camel riding.

I provide the function to run the permutation below. If is a bit long, but it does the job ;)

def run_permutation(data, direct='two-sided', nReps=1000, print_output=False):    """
    Run a permutation test.
    For each permutation, a score is calculated and store in an array.
    Once all permutations are performed for that given participants, the function computes the real score
    It then compares the real score with the confidence interval.
    
    The ouput is a datafram containing all important statistical information.    params: (df) data, dataframe with choice data
    params: (str) direct, default 'two-sided'. Set to 'one-sided' to compute a one sided confidence interval
    params: (int) nReps. number of iterations 
    params: (boolean), default=False. True if feedback to user is needed    """    # PreAllocate significance
    output=pd.DataFrame(columns=['Participant', 'Real_Score', 'Lower_CI', 'Upper_CI', 'Significance'])for iParticipant in range(0,data.shape[0]):        # Pre Allocate
        scores = pd.Series('float')        # Start repetition Loop
        if print_output == True:
            print('Participant #' +str(iParticipant))
        output.loc[iParticipant, 'Participant'] = iParticipant        for iRep in range(0,nReps):
            # Store initial choice distribution to compute real true score
            initial_dat = data.loc[iParticipant,:]            # Create a copy
            curr_dat = initial_dat.copy()            # Shuffle data
            np.random.shuffle(curr_dat)            # Calculate score with shuffled data
            scores[iRep] = indiv_score(curr_dat)
            
        # Sort scores to compute confidence interval
        scores = scores.sort_values().reset_index(drop=True)
        # Calculate confidence interval bounds, based on directed hypothesis
        if direct == 'two-sided':
            upper = scores.iloc[np.ceil(scores.shape[0]*0.95).astype(int)]
            lower = scores.iloc[np.ceil(scores.shape[0]*0.05).astype(int)]
        elif direct == 'one-sided':
            upper = scores.iloc[np.ceil(scores.shape[0]*0.975).astype(int)]
            lower = scores.iloc[np.ceil(scores.shape[0]*0.025).astype(int)]output.loc[iParticipant, 'Lower_CI'] = lower
        output.loc[iParticipant, 'Upper_CI'] = upper        if print_output == True:
            print ('CI = [' +str(np.round(lower,decimals=2)) + ' ; ' + str(np.round(upper,decimals=2)) + ']')
        # Calculate real score
        real_score = indiv_score(initial_dat)
        output.loc[iParticipant, 'Real_Score'] = real_score        if print_output == True:
            print('Real score = ' + str(np.round(real_score,decimals=2)))        # Check whether score is outside CI bound
        if (real_score < upper) & (real_score > lower):
            output.loc[iParticipant, 'Significance'] =0            if print_output == True:
                print('Not Significant')
        elif real_score >= upper:
            output.loc[iParticipant, 'Significance'] =1            if print_output == True:
                print('Significantly above')        else: output.loc[iParticipant, 'Significance']  = -1; print('Significantly below')        if print_output == True:
            print('')
    return output

Now let’s run the permutation test, and look at individual score values

output_chocolate = run_permutation(chocolate_grp, direct=’two-sided’, nReps=100, print_output=False)
output_chocolate

output_non_chocolate = run_permutation(non_chocolate_grp, direct='two-sided', nReps=100, print_output=False)
output_non_chocolate

We can see that, as expected from the way we compute the distributions ,we have much more participants that significantly increased their camel ride preference after the baseline measurement in the chocolate group.

That is much less likely in the non chocolate group, where we even have one significant decrease in preference (participant #11)

We can also see something I find quite important: some participants have a high score but no significant preference, while others have a lower score and a significant preference (see participants 0 & 1 in the chocolate group). That is due to the confidence interval, which is calculated based on each participant’s behavior. Therefore, based on the choice patterns, a given score might fall inside the CI and not be significant, while another, maybe lower score, maybe fall outside this other individual-based CI.

Final words

That was it. Once this analysis is done, you could look at what other, **unrelated** variables, might differ between the two groups and potentially explain part of the variance in the statistics. This is an approach I used in this publication, and it turned out to be quite useful :)

I hope that you found this tutorial helpful.
Don’t hesitate to contact me if you have any questions or comments!

Data and notebooks are in this repo: https://github.com/juls-dotcom/permutation