The most famous tea party in the history of statistics: The comparison of Frequentism and Bayesianism.

Published in

Human Systems Data

6 min readApr 11, 2017

Ronald Fisher, one of the leading pioneers of the modern frequentist statistics, once described a tea party. In this party, he met a lady who claimed that she could discriminate whether milk was poured first or last when preparing a cup of tea with milk. Fisher wanted to test if she was telling the truth and asked her to test 8 cups of tea, which were consisted of 4 cups in which the milk was poured first, and 4 other cups in which the tea was poured first. Although Fisher mentioned this example to describe the frequentist approach, this example can be applied with the bayesian approach as well. Therefore, I will use this example to compare the two approaches. By the way, aren’t you curious to know how the lady did in the test? Let’s hold that question for now.

Let’s first review the frequentist approach and the bayesian approach. Both approaches are methods of making an inference to a population based on the subsets of the population. Nevertheless, they use different methods when making an inference. The frequentist approach considers an experiment as one of many possible repetitions of the same experiment. Therefore, the approach emphasizes how often the observed data will be replicated by chance or by the treatment implemented by the researcher (i.e. How often would the observed result be the product of chance?). In contrast, the Bayesian approach applies the human-like learning process to express the true state of the world in terms of probabilities (i.e. How is it likely that a hypothesis holds true, or is replicated?).

Moving back to the example of the tea party, let’s say that the lady was tested five times (with five cups of tea). The null hypothesis would be that the lady does not have the ability to discriminate the taste of the tea. According to the null hypothesis, she will be correct only on half of the tests. Then the probability that she will pass all the five test would be 3.1% (0.5⁵). This probability is also the p-value, the probability that the statistical significance will be obtained by chance even if the null hypothesis is true. Therefore, if she passes all 5 tests, there is 96.9% chance that she is really capable of discriminating the tea’s taste, and 3.1% chance that she passed the tests thanks to the pure luck. Based on the information, the frequentists approach determines that she can discriminate the taste because the 3.1% chance of error is below the convention of 5% (p-value<.05).

However, just like Kruschke (2010) pointed out, the frequentist statistic has some limitations. First of all, the p-value can easily be changed depending on the intention of the researcher, thus is unreliable. So, when do I stop testing and how many tests are required to ensure that the lady can really discriminate the taste of the tea? I came up with five tests. However, there is no agreed rule of stopping. Some researchers use power analysis to provide an objective stopping criterion, but such power analysis uses a point estimate (a single data point) which tells very little about the replication probability. Moreover, Fisher ironically showed the public that the frequentist approach is vulnerable to the researcher’s intention. In the 1960s, he was devoted to denying the causal relationship between the smoking and the lung cancer, an argument which was refuted by a bayesian statistician named Jerome Cornfield. The funny thing is that Fisher was a heavy smoker, and was funded by the tobacco firms at that time. I am sure that the intention of Fisher is not directly related to the “intention” that Kruschke was talking about. However, this example clearly demonstrated that results from the frequentist approach could be manipulated by researcher’s intention. Another problem with the frequentist approach is that it does not provide much information about the replication probability of the result. For example, the frequentist approach yields confidence interval, which is an estimate of the population range. However, the confidence intervals only inform the range of parameters, but not the most probable value for parameters (i.e. Which specific value within the range is the most likely parameter value?).

Figure 1. This is Fisher, the heavy smoker…

Contrarily, the bayesian approach can provide the probability of the replication, not to mention that it does not seek for the p-value which can be biased by the researcher’s intention. Let’s take a look at the Bayes’ rule (see figure 2) that estimates the posteriori.

The posteriori, P(H|E), is the probability of a hypothesis or a probability of replication. The priori ,P(H), is the incomplete pieces of evidence a researcher currently has. The likelihood, P(E|H), is the new evidence which serves the role of correcting the initial pieces of evidence. Lastly, the probability of the evidence, P(E), is the probability that the evidence itself is true. Now, let’s apply the Bayes’ rule to the hypothetical situation in which the lady passed all five tests. First, we have to identify the hypotheses. One possible hypothesis would be that the lady cannot tell the difference (H1, 50% chance). The other is the hypothesis that the lady has 90% accuracy (H2, 90% chance). At this moment, we do not know how big the chances are that these two hypotheses are true. Therefore, let’s just say that they each have 50% probability that they are true (P(H), current incomplete information). Since the lady passed five tests, the likelihood of each hypothesis is 0.03125 and 0.59049 respectively (0.03125=0.5⁵, 0.59049=0.9⁵). Lastly, the probability of the evidence is the sum of the priori multiplied by the likelihood of each hypothesis.

[P(H1)* P(E|H1)+ P(H2)* P(E|H2)=0.5*0.03125+0.5*0.59049=0.31087]

The complete formula used in this process is described in figure 3.

Figure 3. Calculation of the posteriori for the hypotheses in the “Tea party” example.

As you can see in the process of estimating the posteriori, the Bayes’ approach can be considered as updating the initial incomplete evidence by taking new evidence into consideration. For instance, we initially assumed the probability that the lady could tell the difference, or the probability that she couldn’t, was 50% respectively. However, as we have watched her pass the five tests in a row, we revised the replication probabilities of the two hypotheses. The probabilities of the two hypotheses were changed to 5% and 95%, because it seemed less likely that her performance was due to chance.

In summary, the bayesian approach is an intention-proof approach compared to the frequentist approach. Moreover, it provides more information about other parameters that will allow us to estimate how likely a given result will be replicated. However, a major problem is that the bayesian approach is very difficult to understand, and to use. This may be due to the fact that many people view it as a specific statistical technique that involves some kind of a formula. It is true. However, the bayesian approach can also be viewed as a general principle of making an inference. That is, the Bayesian approach is about updating the evidence you have as you run into a new evidence, and reducing the uncertain parameters in the process.

Finally, I have to admit that I have been deceiving you about the result of the tea party. Fisher did not actually disclose the result of the test, and it is not even sure if the tea party was a real thing. However, the rumor has it that she was tested with eight cups of tea and was able to successfully discriminate all of them.

References

Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends in cognitive sciences, 14(7), 293–300.

https://www.analyticsvidhya.com/blog/2016/06/Bayesian-statistics-beginners-simple-english/

The most famous tea party in the history of statistics: The comparison of Frequentism and Bayesianism.

Written by Hansol Rheem