Statistical Inference in Bayesian and Frequentist Approach

William Hu
Analytics Vidhya
Published in
6 min readJul 15, 2021

I was reading an article about the importance of the label consistency in a machine learning system, and a natural question came into my mind — how can we test the consistency of the data? In this article, I designed a simplified scenario to discuss the solution using statistical inference in both Bayesian and Frequentist approach.

Problem Description
I have a dataset contains 1,000,000 images in 100 categories, but they do not have labels. I am the only one working on this project, so I decide to hire a data labeling company to do the job for me. A reasonable price for labeling those many images in the United States would be $25,555 (based on Google Data Labeling Service), and I am promised that correctness of the labeling work will be 99%. That is a lot of money, so after I receive the labelled data, I want to check if they meet the 99% accuracy. How can I do this?

Data
To test the accuracy of the labeling work, I randomly sampled 500 images from the dataset and manually check the labels for each of them. I found that 12 images were classified incorrectly. What can I generate from this experimental data?

Before we move forward, it is worthy to mention an assumption we made about the experimental data. 500 sample size is very small compared with 1,000,000 population size, so we will assume each sample is an independent and identically distributed random variable (i.i.d).

A Frequentist Approach
In Frequentist framework, we can do hypothesis testing using the data, and calculate the p-value. The p-value is the probability of observing at least extreme as observed value in the direction of alternative hypothesis, given the null hypothesis is true. We will set our hypotheses as follows.

H0 (null hypothesis) : the accuracy of the labelling is 99%.
H1 (alternative hypothesis) : the accuracy of the labelling is below 99%.

we will set the significance value of the test, the alpha value, to be 0.05.

Now we assume the accuracy of the labelling work is 99%, we need to calculate the P-value. But how? What is the probability of observing 12 incorrectly classified images or even more in a 500 random sample, given the accuracy of the work is 99%?

Well, we can think of sampling a single image from the population as a Bernoulli process. The sample space can be defined as {correct, incorrect}, and the probability of observing an incorrectly classified image P = 0.01. Then, we can treat the number of incorrectly classified images in a 500 random sample X as a random variable, and it follows a binomial distribution. Hence, we have the following equation.

This equation can be calculated in R using code below.

Our p-value is 0.005208, which means if our null hypothesis is true, that is the labelling work has an accuracy of 99%, we have a probability of 0.005208 to observe 12 or more incorrectly classified images in a random sample of size 500. This p-value is clearly less than our alpha value 0.05. Thus, we will reject the null hypothesis because the sample test provides enough evidence against the null hypothesis.

The conclusion from Frequentist approach is the accuracy of the labelling work is below 99%.

A Bayesian Approach
In Bayesian paradigm, we also need to set hypotheses, but it is slightly different from the Frequentist approach. Bayesian hypotheses, generally referred as priors, are possible models that generate the experimental data. In our case, priors refer to possibility of all possible accuracies before we start the experiment. Then, after we have the priors of each accuracy, we will calculate the likelihood of observing such data in the experiment under each accuracy. In our case, it is the likelihood of observing 12 incorrectly classified images in a 500 random sample under each accuracy value. Lastly, using Bayesian theorem, we can calculate the posterior distribution of each accuracy values, which are the possibilities of each accuracy given we observed 12 incorrectly classified images. The accuracy value with the highest posterior probability will be our best guess for the real accuracy given our data.

In Bayesian inference, prior distribution is one’s belief about some quantity before some evidence is taken into account. In our case, we are interested in the real accuracy of the labelling work. Before our sample test, what do we know about the accuracy of work? Well, we are promised by the company that the accuracy is 99%, but we do not fully trust it. We also know that, given the reputation of the company, it is very unlikely for the labelling work to have an accuracy below 90%. Given the knowledge we have, I will set our prior distribution as follows. I give 99% accuracy 0.5 priors, because that is the guarantee from the company, and I distribute the other 0.5 equally among all other possible accuracy values. The sum of all the priors should equal to 1.

Next, we will calculate the likelihood of observing such data under each accuracy value. Again, this process can be viewed as a Bernoulli process. For example, if the accuracy is 0.1, then the number of incorrectly classified images in a random sample of 500 follows a Binomial distribution, with n = 500 and p = 0.9. An important difference from the Frequentist approach is that, the idea of observing at least extreme data as observed is gone in the Bayesian approach. The likelihood is the probability of observing the exact experimental data under each model (accuracy value).

Below is the likelihood for each accuracy.

Now we can calculate the evidence, P(data) = sum(prior * likelihood). After that, we can calculate the value of the posterior using Bayes formula. The results are shown below.

From the result, posterior for acc=0.99 is only 0.144, while the posterior for acc <0.99 is 0.856. Hence, we can conclude that it very likely the accuracy of the labelling work is below 99%. This conclusion aligns with our conclusion from the Frequentist approach.

Below is the complete R code for the Bayesian inference.

Conclusion
From both the Frequentist approach and the Bayesian approach we concluded that the real accuracy of the labelling work is below 99%, which is not surprising at all. Intuitively, if the accuracy is 99%, the expected value of incorrectly classified images n = 500 * 0.01 = 5, but we observed 12. So we know the accuracy is very likely to be below 99%. In fact, if we take a close look at the posterior table, we can find that 97% and 98% accuracy has the highest posterior probabilities. The expected number of incorrectly classified images under those accuracy are 15 and 10 respectively. We observed 12, which is in between but closer to 10. That is why 98% accuracy has the highest posterior probabilities.

In this article, we did a simple comparison between the Frequentist inference in Bayesian inference. However, there are several things we are not able to answer at this point. For example, how confidence we are about our conclusion? What happen if we just sample a bad portion of the images, and in fact the general accuracy of the work is 99%? I will address those problems in my next article. Thanks for reading!

--

--