Bayesian analysis based on MCMC Gibbs algorithm. A comparison to the classical(frequentist) analysis of both Fixed effect model and Random effect model — A simulation study in R + WinBUGS
[Under Review]
Introduction
Let’s generate the data set we are going to use
We consider seven(7) treatment single arms on the hypertensive health condition monitored by the systolic blood pressure. Blood pressure is a regularly preferred biomarker(e.g. body temperature is a well-known biomarker for fever) to study efficacy of potential anti-hypertensive agents aimed at reducing the risk of complications such as stroke. It is less expensive and more feasible. Blood pressure readings are always given in pairs, reported in a fraction: on the numerator the systolic blood pressure and on the denominator the diastolic blood pressure(e.g. 132/88 mmHg). Both blood pressures are correlated and they both influence the risk factors of a patient. However, systolic blood pressure is considered more important than diastolic blood pressure in practice.
In this simulation study, we consider data on systolic blood pressure. Patients are measured at entry of study(at baseline) and at the end of study. This last value is subtracted from the baseline value, for each patient, then a mean value is generated for all patients in each treatment arm. We assume, the mean change from baseline(we simply call it mean effect throughout) in each treatment arm, is either fixed, or randomly selected from a common distribution that we choose to be a normal distribution(could be other distributions). As we rely on a potential linear relationship between treatment and patient response, the latter is generated the following two distinct ways:
- [Fixed effect ANOVA] The mean effects mu = c(1.4, -7.5, 1.3, -5, 2.5, 4, -.6) are fixed but vary across the 7 treatment arms= c(“A”, “B”, “C”, “D”, “E”, “F”, “G”). The individual response “ildFE” is generated as
Variations in the observed responses are described by a systematic component and a stochastic component. The systematic component is the expected response, and the stochastic component “eps” (epsilon), is the so called residual. It measures the difference between the observed response and the expected response of each individual. Response to treatment is measured in our case study as a continuous variable so we can naturally assume epsilon to be normally distributed with mean zero for simplicity, and some standard error sigma = 2.
comments on the code above: You need two packages “MASS” and “tidyverse”. Set the seed to “112020” for me and you to generate the same eps values. The seed remain the same throughout.
2. [Random effect ANOVA] We consider the mean effects of treatments to be randomly generated from a normal distribution with a common grand population mean Mu, and a common grand population variance Var. For simplicity, we choose Mu to be the mean of the vector set in the fixed effect case i.e Mu = mean(mu), and Var to be its variance i.e Var = var(mu). We could have chosen arbitrarily values for Mu and Var. The individual response “ildRE” is generated as: everything above remain unchanged except the defined function y
With the same seed set, we will reproduce the same set of values from rnorm(1, Mu, sqrt(Var)) and eps. So as mu ~ Normal(Mu, Var) in this setting, “Mu” and “Var” are called hyperparmeters because they are one level higher than the parameters mu that they govern.
Unlike in the random effect setting where the mu’s are related, in the fixed effect case, the mu_i’s are unrelated and independent.
The responses can be summarized graphically, using simple plots
Histograms
par(mfrow = c(1, 2))
hist(ildFE, col = “#228B22”) #forest green
hist(ildRE, col = “red”)
Boxplots
The purple horizontal line is the mean response.
Let’s verify through the diagnostic plots, whether linear regression assumptions are satisfied.
Fitted or predicted values are estimates of the “expected response” component. The horizontal trend of the first plot is a good indication of a linear relationship between “ildFE” and “trt” , the red line highlight the fair horizontal trend. The next plot on the right indicates that the residuals are indeed perfectly normally distributed. It compares the quantiles of epsilon/sigma (the standardized residuals) and the ones of the standard normal distribution N(0,1), because 1/sigma*epsilon would~N(0,1). The somehow horizontal red trend line in the plot below indicates homoscedasticity or constant variance assumption of epsilon satisfied. The last plot indicates there is not a presence of extreme values that may influence the results.
Classical analysis
It employs maximum likelihood method to estimate parameters. ML methods provide the best(the maximum value) estimates for each parameters in a model. The idea of “long run frequency”, comes from the fact that, with “large” samples, classical inference tend to be unbiased, as it provides estimates that are very much closer to the true values. In contrast, Bayesian inference is exact for any sample size. The restricted maximum likelihood [REML], is another likelihood frequentist analysis. The parameters of interest here are: the 7 mean effects for the treatment groups and the residual standard error sigma.
Maximum Likelihood Analysis
The estimated values of the mean effects relative to the control group A, for treatment arms B to G, are given in the column of “Estimate”. Whereas the first estimated value of 1.95 in the same column is the mean effect estimate for treatment group A. To generate the mean effect for B for instance, one has to get it as the sum -9.6241 + 1.949 = -7.6751. Same for C: -0.5461 + 1.949 = 1.4029, and so on. In fact:
Compare those, with the true values mu = c(1.4, -7.5, 1.3, -5, 2.5, 4, -.6). This shows a reasonably decent consistent results right?.
An estimate value for sigma, the residual standard error, is generated as 1.821. Remember, the true value is 2. Are the means of the 7 treatment groups statistically significantly different? The p-value for this one-way ANOVA F-test tells us Yes! at significance level of .05.
We can also do this kind of analysis with the random effect data
and this
We do not know the true values of these. But the true value of sigma remain the same 2 and it is more accurately predicted here 1.99. The means are still indicated to be statistically significantly different from each other in this setting.
An alternative, probably appropriate way to fit this data generated under the random effect setting, is through the lmer function in R. “lmer” fits a linear mixed-effects model (LMM) to data, via the REML or the MLE. We fit the simplest mixed effect model with REML:
lmer.RE <- lmer(ildRE ~ 1 + 1 | trt, REML = TRUE)
where the random term 1|trt indicates that, each level in the factor variable “trt” has its own intercept estimates. You need to load the package “lme4” and specify trt as a factor variable.
The estimate of the grand population mean effect -6.527 (you can extract this value with mean(coef(lmer.RE)$trt[1][,’(Intercept)’]). The true value is mean(mu) = -0.56. The corresponding standard deviation is 16.197 (extract it with sd(coef(lmer.RE)$trt[1][,’(Intercept)’]). The true value is sqrt(var(mu) )= 4.19. The residual standard error is the same as found with the lm() function, 1.99. Intercept estimated values for the seven factors are:
And so the absolute mean effects of these factors are
The relative effects can then be derived the following way
The mean effect in group A is -7.28. The mean effect in group B relative to A is 16.018, which is its absolute effect 8.7375 minus -7.28. The interpretation is similar for the rest.
Bayesian analysis
An overview of how Gibbs sampling works.
For the fixed effect setting, our theta is a vector of 8 parameters: the 7 parameters to account for the mean effect in each treatment arm, and the parameter for the residual standard error. You can consider the data x as “ildFE”. The probability of the parameters theta, given data x, p(theta|x), is given from the Bayes rule by:
It combines information about theta from the likelihood p(x|theta) of the observed data x, and from the prior knowledge p(theta). The quantity p(x), which is independant of theta, is not actually ignored and it can’t be ignored. However, it often constitutes a stumbling block in deriving p(theta|x). This is where sampling techniques come in. Markov Chain Monte Carlo (MCMC), is a set of algorithms to draw a stream of samples from the posterior distribution, for each parameter. The metropolis hasting algorithm(Hastings (1970)) and the Gibbs sampling algorithm(Geman and Geman, 1984), are widely known. The idea with Gibbs sampling is to sample from conditional posterior distributions, rather than the full posterior distribution. It breaks a big problem into smaller solvable sub problems.
WinBUGS(Windows for Bayesian inference Using Gibbs Sampling), is a software, built to generate samples based on Gibbs algorithm. It is an MCMC blackbox. The software offers a “pointing” and “clicking” platform, which maybe stressful for some users. Fair enough, there is a way to use it through add-on packages in R such as, R2WinBUGS (Sturtz et al., 2005). Thus, to exploit WinBUGS from R, you need to load the package and then specify the following items
(1) your model: this will be saved, for later use, in your working directory as a txt file with the name “anovaFEM.txt”.
(2) your data: data4bugs <- list(“resp” = ildFE, “trt” = as.numeric(factor(trt))). The variable “trt” is used as a numeric variable in WinBUGS, not a factor.
(3) your arbitrary initial values for the chains: inits <- function(){ list(mu = rnorm(7, mean = 10), sigma = rlnorm(1) )}. This will run 3 times for 3 chains for instance. Intitial values are specified only for parameters with explicitly defined values in the model. Here these are “mu” and “sigma” only.
(4) tell WinBUGS which parameters it should save the posterior draws: params <- c(“est.effect”, “sigma”, “est.releff1”, “est.releff2”, “est.releff3”, “est.releff4”, “est.releff5”, “est.releff6”, “residuals”, “er”). The first two parameters are our 8 main parameters of interest. “est.effect” contains information on the 7 parameters mentioned above and “sigma” is the 8th one. But we were also interested in the estimates of the relative effects, the residuals and the predicted posterior values.
(5) MCMC settings also requires you to specify: the number of chains nc <- 3; the number of iterations ni <- 1500; the number of burn-ins nb <- 300; and the number of thining nt <- 2. These settings are explained more below.
Then invoke WinBUGS through the function bugs(), to perform the analysis. Save results in the object named “bugs.output”.
bugs.output <- bugs(data = data4bugs, inits = inits, parameters = params, model = “anovaFEM.txt”, n.thin = nt, n.chains = nc, n.burnin = nb, n.iter = ni, debug = F, bugs.directory=”C:/Users/…/Desktop/WinBUGS14", WINE=NULL)
If you want to see what goes on in WinBUGS when the code is running, and if you want to manually continue some additional analysis/plots within WinBUGS right after the code has run, set debug = T(you will then have to press the “Esc” button on your keyboard to exit the WinBUGS platform when you are done). It may also help you spot easier, some red flags from the R output. “bugs.directory” indicates where you saved the WinBUGS software. In my case, I saved it on my Desktop.
You can have an overview of all the information you need to make inference, by executing the code name(bugs.output):
A list of twenty four items which you can access individually with the dollar sign $. Use the str() to know more about accessibility:
To access information on a parameter of interest, say “est.effect”, you have to apply the dollar sign $ twice like this:
Each of the 3 chains has 1500 iterations set initially, from which the 1st 300 are set to be discarded. So 1200 * 3 = 3600 total iterations are supposed to be used for inference. However, the thinning number requires to each time jump/ignore the next 2 iterations out, and so we’ve to divide this 3600 by 2, to get 1800 total iterations, the inference will finally be based on. So for instance, each one of the 7 involved parameters in “est.effect” above, has 1800 iterations. If you increase the thinning number to 3 or 5 for instance, the final number of iterations decreases considerably.
The object “bugs.output” itself, has summary information of all its 24 items included. These summary information on each one of the parameters, are the commonly known point estimates and interval estimates of these long string of numbers fromt he joint posterior distribution: the mean/average posterior value, the sd posterior value, the credible interval bounded by the 2.5th and the 97.5th percentile. The 50th percentile is the median and it coincides with the mean most of the times probably because the samples(the number of iterations), is large enough.
While classical analysis estimates a single point through the maximum likelihood function, Bayesian analysis seeks to estimate a whole distribution. Lets see how some of the (posterior)distributions look like.
They look normally distributed. This can explain why their measures of central tendency all coincides: the mean the median and the mode.
Are these samples representative of the posterior distribution? Initial starting values are probably not from the target posterior distribution, and since draws are successive and dependent(each draw depends on the previous draw), to avoid the chain of samples to be affected by these “unnecessary” samples, a number of preliminary samples are set to be discarded as “burn-in”. They are unrepresentative of the equilibrium distribution of the Markov chain.
So one thing to bother about is, have the chains converge for each one of the parameters? we assess this graphically and with the numerical summary of the Brooks–Gelman–Rubin statistic(Gelman et al., 2004), called Rhat(about 1 at convergence, with 1.1 often taken an acceptable threshold).
The Rhat summary values are stored in the 8th column of “bugs.output”. In fact, this piece of code, which(bugs.output$summary[,8] > 1.1), will tell you that, there are no worries about convergence.
Next, let’s inspect plots of the chains, for only two parameters: the mean effect in group B and the residual standard error.
The code used to generated that is:
The chains look admirably like oscillograms describing a horizontal trend which should be around the mode(the sample value who has the most occurrences in the 3 chains combined) of the samples. This shows that, the samples are from the right joint posterior distribution. There is no standard in-built function to calculate the mode in R. However, this function can help:
We have similar observations for the sigma parameter.
The effective sample size, column “n.eff”, corrects for the degree of auto-correlation within the chains. Smaller effective sample size indicates that the corresponding samples are more correlated. Our samples are pretty much independent. You can check below with the code for sigma for instance acf(bugs.output$sims.list$sigma):
The acf() plots start with correlation of one with first sample, and with the next sample the correlation is close to zero, and so on, gradually it reduces down to zero because the successive samples are highly uncorrelated.
The deviance information criterion (DIC) in Bayesian analysis, an analog(Spiegelhalter et al., 2002) to the Akaike’s information criterion (AIC) in classical analysis. It expresses the trade-off between the fit of a model and the variance of (i.e., uncertainty around) its estimates. A complex model may fit a data in a very much better way than a simple model, but produces less precise estimates. DIC = 1736.7 and AIC(lm.FE) = 1736.565.
The estimated number of parameters (pD = 8.0) is somehow what we would expect it to be: 7 parameters + 1 parameter, formally, as explained above. It is not always an integer though.
We can generate diagnostic plots with these posterior values as well with the following code
Now that we are convinced to some extend that the model is adequate for the ildFE data, we inspect the estimates and compare them with what we put into the data set, as well as what the frequentist analysis in R tells us.
Summary statistics from the classical and bayesian analysis are combined in the ggplot below.
You can find the code here. The green spots are the true values. Bayesian output shows, accurate and more precise estimates, exept for A. Only one of the (classical) estimates is not significant as the CI crosses the zero horizontal line.
The Bayesian analog to a 95% confidence interval is called a 95% credible interval (CRI). It is any region of the posterior containing 95% of the area under the curve. There is more than one such region, and one particular CRI is the highest-posterior density interval (HPDI). In this case study, we consider 95% CRIs that are bounded below by the 2.5 percentile and above by the 97.5 percentile points of the posterior sample of a parameter. The HPDI also denotes the limits of the narrowest segment of the posterior distribution containing 95% of its total mass. Generate it in R with the function HPDinterval(), from the package “coda”. For large samples sizes and vague priors, the two will typically CIs and CRIs are indistinguishable. Another remark for classical CI is that, we can not make any probability statement about CI. In contrast, the posterior probability in a Bayesian analysis measures our degree of belief about the likely magnitude of a parameter.
We perform similar bayesian analysis with the random effect data “ildRE”.
sink(“anovaREM.txt”) cat(“
model {
# Priors and some derived things
for (i in 1:na){
mu[i] ~ dnorm(mu.pop, tau.pop) # Priors for the 7 mean effects
# mu.pop, tau.pop are specified, to account for random effects
est.effect[i] <- mu[i] # absolute effects
}
mu.pop ~ dnorm(0,0.001) # Hyperprior for the grand pop mean effect
sigma.pop ~ dunif(0, 10) # Hyperprior for the grand pop sd
tau.pop <- 1 / (sigma.pop * sigma.pop) # In WinBUGS, spread is specified as …# precision = 1/variance.# Likelihood
for (i in 1:N) {
resp[i] ~ dnorm(er[i], tau.res)
er[i] <- 0 + mu[trt[i]]
}sigma.res ~ dunif(0, 10) # Prior for residual sd
tau.res <- 1 / (sigma.res * sigma.res)# relative effects
for (i in 2:na){
est.releff[i-1] <- est.effect[i] — est.effect[1] # our estimates
}
} “,fill=TRUE)
sink()
Straight to the output, we get
and combining comparable results in the ggplot like above, gives
The code is here. Estimates, pretty much coincide perfectly, except for the grand mean and the grand sd. Observe that, bayesian output has point estimates and interval estimates for all the pre-specified parameters, while classical output does not(e.g. no confidence interval is generated for sigma, sigma.pop, mu.pop).
More inference can be done from all these output, but we stop here for now.
Conclusion
From the comparisons, Bayesian and classical are virtually numerically identical. This is because, a Bayesian analysis based on vague priors, would be more driven by the likelihood than the priors.
When thinking of making a fixed-effects or random-effects assumption about a factor, one should weigh the pros and cons. In fixed effect setting, effects estimates of each level of the factor is your priority and not the variance among levels. Your model is limited to the included factor levels and so you cannot generalize to other external/general factor levels. With random effect settings, it would make sense to generalize to a larger population and to assess the variation among the factor levels in that population. Further, in RE, the distribution of effects are exchangeable(similar, but not identical), because generated from a common stochastic process. Random effects estimators are often considered as “shrinkage estimators”. Random factors are not estimated independently. Shrinkage seemingly results in better estimates (e.g., more precise) than the one’s from a fixed-effects analysis, probably because a random-effects analysis “borrows strength”. Overall, RE improved accounting for model uncertainty, and model efficiency in the estimation process.
Convergence in MCMC analyses is, in contrast to classical analysis, a way to make sure that the global maximum in a likelihood function has been found in classical statistical analyses. The Brooks–Gelman–Rubin (BGR) diagnostic statistic, is an ANOVA-type diagnostic that compares within and among-chain variance.