# Back to Hypothesis Testing

The Made in Italy Fund started in May. It is up 7%, with the Italian market down 1%. It is a good time to go back to Hypothesis Testing.

We ask ourselves questions and give ourselves answers in response to *thaumazein*: wonder at what there is. Our questions spring from our curiosity. Our answers are grounded on evidence.

As David Hume and then Immanuel Kant made clear, *all* our answers are based on evidence. Everything we know cannot but be *phenomena* that are experienced by us as evidence. Even Kant’s synthetic a priori propositions — like those of geometry and mathematics — are ultimately based on axioms that *we* regards as self-evidently true.

The interpretation of evidence arranged into explanations is what we call Science — knowledge that results from separating true from false. Science is based on observation — evidence that we preserve and comply with. We know that the earth rotates around the sun because we observe it, in the same way that Aristotle and Ptolemy knew that the sun rotated around the earth. We are right and they were wrong but, like us, they were observing and interpreting evidence. So were the ancient Greeks, Egyptians, Indians and Chinese when they concluded that matter consisted of four or five elements. And when the Aztecs killed children to provide the rain god Tlaloc with their tears, their horrid lunacy — widespread in ancient times — was not a fickle mania, but the result of an age-old accumulation of evidence indicating that the practice ‘worked’, and it was therefore worth preserving. So was divination — the interpretation of multifarious *signs* sent by gods to humans.

While we now cringe at human sacrifice and laugh at divination, it is wrong to simply dismiss them as superseded primitiveness. Since our first Why?, humankind’s only way to answer questions is by making sense of evidence. Everything we say is some interpretation of evidence. Everything we say is science.

Contrary to Popper’s view, there is no such a thing as non-science. The only possible opposition is between good science and bad science. Bad science derives from a wrongful interpretation of evidence, leading to a wrongful separation of true and false. This in turn comes from neglecting or underestimating the extent to which evidence can be *deceitful*. Phenomena do not come to us in full light. What there is — what we call reality — is not always as it appears. Good science derives from paying due attention to the numerous perils of misperception. Hence the need to look at evidence from all sides and to collect plenty of it, analyse it, reproduce it, probe it, disseminate it and — crucially — challenge it, i.e. look for new evidence that may conflict with and possibly refute the prevailing interpretation. This is the essence of what we call the Scientific Revolution.

Viewed in this light, the obvious misperceptions behind the belief in the effectiveness of sacrifice and divination bear an awkward resemblance to the weird beliefs examined in many of my posts. Why did Steve Jobs try to cure his cancer with herbs and psychics? Why do people buy homeopathic medicines (and Boiron is worth 1.6 billion euro)? Why do people believe useless experts? Why did Othello kill Desdemona? Why did Arthur Conan Doyle believe in ghosts? Why did 9/11 truthers believe it was a conspiracy? Why do Islamists promote suicide bombing? It is tempting to call it lunacy. But it isn’t. It is misinterpretation of evidence.

The most pervasive pitfall in examining available evidence is the Confirmation Bias: focusing on evidence that supports the hypothesis under investigation, while neglecting, dismissing or obfuscating evidence that runs contrary to it. A proper experiment, correctly gathering and analysing the relevant evidence, can easily show the ineffectiveness of homeopathic medicine — in the same way as it would show the ineffectiveness of divination and sacrifice (however tricky it would be to test the Tlaloc hypothesis).

In our framework, PO=LR∙BO, where LR=TPR/FPR is the Likelihood Ratio, the ratio between the True Positive Rate — the probability of observing the evidence if the hypothesis is true — and the False Positive Rate — the probability of observing the same evidence if the hypothesis is false. The Confirmation Bias consists in paying attention to TPR, especially when it is high, while disregarding FPR. As we know, it is a big mistake: what matters is not just how high TPR is, but how high it is *relative to* FPR. We say evidence is confirmative if LR>1, i.e. TPR>FPR, and disconfirmative if LR<1. LR>1 increases the probability that the hypothesis is true; LR<1 decreases it. We cannot look at TPR without at the same time looking at FPR.

How high does the probability of a hypothesis have to be for us to accept that it is true? Equivalently, how low does it have to be for us to reject the hypothesis, or to accept that it is false?

As we have seen, there is no single answer: it depends on the specific *standard of proof* attached to the hypothesis and on the utility function of the decision maker. For instance, if the hypothesis is that a defendant is guilty of a serious crime, a jury needs a very high probability of guilt — say 95% — before convicting him. On the other hand, if the hypothesis is that an airplane passenger is carrying a gun, a small probability — say 5% — is all a security guard needs in order to give the passenger a good check. Notice that in neither case the decision maker is saying that the hypothesis is true. What he is saying is that the probability is *high enough* for him to act as if the hypothesis is true. Such threshold is known as *significance level*, and the accumulated evidence that allows the decision maker to surpass such threshold is itself called *significant*. We say that there is significant evidence to convict the defendant if, in the light of such evidence, the probability of guilt exceeds 95%. In the same way, we say that there is significant evidence to frisk the passenger if, in the light of the available evidence, the probability that he carries a gun exceeds 5%. In practice, we call the defendant ‘guilty’ but, strictly speaking, it is not what we are saying — in the same way that we are not saying that he is ‘innocent’ or ‘not guilty’ if the probability of Guilt is below 95%. Even more so, we are not saying that the passenger is a terrorist. What matters is the *decision* — convict or acquit, frisk or let go.

With such proviso, let’s examine the standard case in which we want to decide whether a certain claim is true or false. For instance, a lady we are having tea with tells us that tea tastes different depending on whether milk is poured in the cup before or after the tea. She says she can easily spot the difference. How can we decide if she is telling the truth? Simple: we prepare a few cups of tea, half one way and half the other, and ask her to say which is which. Let’s say we make 8 cups, and tell her that 4 are made one way and 4 the other way. She tastes them one after the other and, wow, she gets them all right. Surely she’s got a point?

Not so fast. Let’s define:

H: The lady can taste the difference between the two kinds of tea.

E: The lady gets all 8 cups right.

Clearly, TPR — the probability of E given H — is high. If she’s got the skill, she probably gets all her cups right. Let’s even say TPR=100%. But we are not Confirmation-biased: we know we also need to look at FPR. So we must ask: what is the probability of E given not-H, i.e. the lady was just lucky? This is easy to calculate: there are 8!/[4!(8–4)!]=70 ways to choose 4 cups out of 8, and there is only one way to get them all right. Therefore, FPR=1/70. This gives us LR=70. Hence PO — the odds of H in the light of E — is 70 times the Base Odds. What is BO? Let’s say for the moment we are prior indifferent: the lady may be skilled, she may be deluded — we don’t know. Let’s give her a 50/50 chance: BR=50%, hence BO=1 and PO=70. Result: PP — the probability that the lady is skilled, in the light of her fully successful choices, is 99%. That’s high enough, surely.

But what if she made one mistake? Notice that, while there is only one way to be fully right, there are 4 ways to make 3 right choices out of 4, and 4 ways to make 1 wrong choice out of 4. Hence, there are 4x4 ways to choose 3 right and 1 wrong cups. Therefore, FPR=16/70 and LR=4.4. Again assuming BO=1, this means PP=81%. Adding the 1/70 chance of a perfect choice, the probability of one or no mistake out of mere chance is 17/70 and LR=4.1, hence PP=80%. Is that high enough?

I would say yes. But Ronald A. Fisher — the dean of experimental design and one of the most prominent statisticians of the 20th century — would have none of it.

More on the next post.

*Originally published at **Bayes**.*