Bayesian Estimation with pymc3
Let’s say you want to compare some statistic across two populations. The most common way to do this is run a t-test and calculate a p-value. However, p-values are notoriously unintuitive, leading to much confusion on what they are actually telling you. This guide will show you how compare this statistic using Bayesian estimation instead, giving you nice and interpretable results. In order to do this, you will need the pymc3 package.
Let’s say we want to compare two populations from which we took the data ‘vals1’ and ‘vals2’. We will need to assume a prior distribution that we believe models the distribution of the mean of those values. In this example, we will use normal distributions for all relevant stats. The following code sets the prior distributions of the mean equal for our two groups.
import pymc3 as pm
with pm.Model() as model:
group1_mean = pm.Normal('group1_mean', mean_prior_mean, sd=mean_prior_std)
group2_mean = pm.Normal('group2_mean', mean_prior_mean, sd=mean_prior_std)
Also set a prior distribution for our believe about the standard deviation. I will use a normal distribution again with mean and std. dev. of ‘std_prior_mean’ and ‘std_prior_std’.
with pm.Model() as model:
group1_std = pm.Normal('group1_std', std_prior_mean, sd=std_prior_std)
group2_std = pm.Normal('group2_std', std_prior_mean, sd=std_prior_std)
Now that we have defined our prior distributions for the mean and standard deviation, we can combine these into a single distribution for each group. Here is where we differentiate the two groups for the first time. ‘vals1’ and ‘vals2’ are lists with the relevant data for our two populations.
group1 = pm.Normal('group1', mu=group1_mean, sd=group1_std, observed=vals1)
group2 = pm.Normal('group2', mu=group2_mean, sd=group2_std, observed=vals2)
We can also define the differences of means and standard deviations which will give us single numbers and confidence intervals later for easy comparison.
diff_of_means = pm.Deterministic('difference of means', group1_mean - group2_mean)
diff_of_stds = pm.Deterministic('difference of stds', group1_std - group2_std)
Now we just have to use the sample method to update our prior distributions with the relevant data. The pymc3 package will use the No U-turn Sampler, a Markov-chain Monte Carlo method, to estimate our posterior distributions. The more samples you take, the closer the generated distributions will be to the actual posterior distributions.
trace = pm.sample(25000, njobs=1)
We can use the traceplot function to generate plots of all our posterior distributions, ‘group1_mean’, ‘group2_mean’, ‘group1_std’, ‘group2_std’, ‘diff_of_means’, and ‘diff_of_stds’.
Personally, I like using the forestplot function. This will plot the credible intervals of our means and standard deviations on the same plot, making it very easy to see the difference as well as the overlap between the two. This will also display the r-hat statistic which tells you how closely your values have converged. The closer r-hat is to 1, the better convergence you have to the true posterior distributions. Note that you may wish to discard a chunk of data from the beginning from when the NUTS sampler was still converging. In this example, I discarded the first 3000 values.
pm.forestplot(trace[3000:], varnames=[v.name for v in model.vars])