# Bayesian statistics — probability distribution over p-value

“Hey, I am not normal. Please transform me!”, my data said to me, while I was busy hacking the p-value without realizing that it doesn’t meet the required assumptions. Traditional data analysis requires a lot of assumptions to be met, before you can operate on the data. Bayesian model comes to the rescue when it is not possible to meet those assumptions. Bayesian computations are not challenged by unbalanced designs, non-normality, sample size and homogeneity of variance in data. It is the most rational approach to infer what parameter values are most credible, and should be used by cognitive scientists for analysis of their empirical data. It is focused on the world: makes analysis about understanding, not significance testing. It also allows for computation of replication probability with the help of posterior distribution(probability distribution, used for inference), which is difficult in case of Null Hypotheses Testing.

Let us consider the following data for breast cancer(C)/mammogram(M) to understand Bayes’ theorem, which is the root of Bayesian analysis.

From this table, the joint probabilities are easy to read. What is the chance that a person has breast cancer(C-True) and received a negative mammogram(M-False)? 3 in 1000 i.e. 0.3%. What is the chance that a person does not have cancer, and received a positive mammogram? 99 in 1000, or 9.9%. What is the chance that a person has breast cancer given that they received a positive mammogram, i.e. P(C-True/M-True)? From the Bayes’ theorem, we have

P(C-True/M-True) = P(M-True/C-True) * P(C-True) / P(M-True)
Where, P(M-True/C-True) = P(M-True & C-True) / P(C-True) = 11/14
P(C-True) = 14/1000 and P(M-True) = 110/1000
Therefore, P(C-True/M-True) = 11/14 * 110/1000 / 14/1000 = 10%

Table 1 represents the result that the likelihood that a patient has cancer- even with a positive mammogram - is still rather low (10% in this case, as derived above). This surprising result is driven by the fact that the positive predictive value (number of true positives divided by the number of predicted positives) is very low as is the likelihood of a positive. Put differently, a mammogram does not appear to have a good success rate at predicting cancer (for this data) and the overall rate of cancer is quite low.

Now let us model a simple coin toss experiment to answer this question: Given an outcome (D) what is the probability of coin being fair (θ=0.5). Using Bayes’ Theorem, we have: P(θ|D)=P(D|θ) * P(θ)/P(D).

Here, P(θ) is the prior i.e the strength of our belief in the fairness of coin before the toss. It is perfectly okay to believe that coin can have any degree of fairness between 0 and 1 (an unbiased coin has 0.5). P(D|θ) is the likelihood of observing our result given our distribution for θ. If we knew that coin was fair, this gives the probability of observing the number of heads in a particular number of flips.P(D) is the evidence. This is the probability of data as determined by summing (or integrating) across all possible values of θ, weighted by how strongly we believe in those particular values of θ. If we had multiple views of what the fairness of the coin is (but didn’t know for sure), then this tells us the probability of seeing a certain sequence of flips for all possibilities of our belief in the coin’s fairness.P(θ|D) is the posterior belief of our parameters after observing the evidence i.e the number of heads. We need the likelihood function P(D|θ) and distribution of prior beliefs, in order to model the posterior belief P(θ|D) distribution. In a nutshell: What we think about the world after seeing data = What we thought about the world before seeing data X Chance we’d see our data under different assumptions about the world. In other words, Posterior = Prior X Likelihood or Evidence.

Let us dive into R now, and plot the probability densities for prior and posterior beliefs. Suppose, you think that a coin is biased. It has a mean (μ) bias of around 0.6 with standard deviation of 0.1. Then some mathematical calculations (not covered here) will reveal that, α= 13.8 , β=9.2, where α and β are called the shape deciding parameters of the density function. Here α is analogous to number of heads in the trials and β corresponds to the number of tails. Our distribution will be biased on the right side. Suppose, you observed 80 heads (z=80) in 100 flips(N=100). This will lead to following models for prior and posterior (mathematical derivation not included here)

Prior = P(θ|α,β)=P(θ|13.8,9.2)

Posterior = P(θ|z+α,N-z+β)=P(θ|93.8,29.2)

Let’s see how our prior and posterior beliefs are going to look, using R:

As more and more flips are made and new data is observed, our beliefs get updated. This is the real power of Bayesian Inference. Assuming two more cases, where we observed 160 heads in 200 flips, and 320 heads in 400 flips will generate the following density curves. As indicated in Figure 3 below, the posterior curve gets sharper and sharper as the number of flips increases.

However, Bayesian analysis does have some disadvantages. Firstly, to express your prior beliefs using probability distributions, you need to know the functional characteristics of lots of probability distributions (Bernoulli, Gamma, Poisson, Binomial, Dirichlet etc.), which is quite taxing if you are not well versed with them. Secondly, drawing from an entire probability distribution is a much more ambitious task than finding a optimal point and takes a lot longer.

Here is all of the code used in this blog post- https://github.com/vipin8169/HSE598/blob/master/Bayesian.R

References:-
Kruschke J. K. (2010). “What to believe: Bayesian methods for data analysis”. Trends in Cognitive Sciences 14 (2010) 293–300
An Introduction to Bayesian Inference using R Interfaces to Stan, Part I. Retrieved from http://mc-stan.org/workshops/useR2016/half1.html