3.) The Trouble with the p Value

Martin Rezny
Words of Tomorrow
Published in
4 min readMar 22, 2016

--

By MARTIN REZNY

Missed the first part of this series about scientism? Go back.

I’m glad that someone has brought up the p value because that’s indeed a very problematic issue, and the way it’s being used to infer the truthiness of experimental results is quite arbitrary, or to use a better term, customary.

As the American Statistical Association concluded:

The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.

To put it as simply as possible, but not more, the p value has to do with the margin of error in a statistical experiment. Specifically, how likely it is that the obtained result is a solid proof and not just a fluke, or something called an “artifact” (think of it as a statistical mirage). In any reasonably designed statistical test, the null hypothesis (situation where the tested hypothesis doesn’t have any effect) must be expressible as a prediction for the likely distribution or change of tested variables. In other words, you need to be able to compare what it would look like when your hypothesis is right or not.

If, for instance, you’re testing whether a new drug works, you take two groups of patients diagnosed with the same illness, one gets the drug and the other gets placebo (a pill, injection, or other procedure guaranteed to have no chemical effect on the illness). The placebo group is the null hypothesis situation, and if the drug group does any better in any measurable way than the placebo group, it may mean that the drug actually does something (which is the experimental hypothesis you’re trying to prove).

But that’s not really enough by itself as a definitive proof because every experiment has a margin of error that may sometimes be quite tricky to pin down. Maybe some patients were misdiagnosed, maybe some treatment has been misapplied, maybe some of the patients have something going on that interacts with the drug that you don’t know about (don’t “control” for), you’ve been plain unlucky, etc. The more cases you involve in your study, the more patients (and whole experiments) in this example, the lower will be the relative percentage of cases in which something has gone wrong or weird and thus confusing the result of the test and maybe showing a false positive.

How many cases you include into the test is your “sample” size. As the sample increases, the error margin and with it the p value (probability of the result being misleading) go down. If you’re interested in the math of the sample sizes and related errors, check out this calculator, but for the rest of you, it will suffice to say that if you tested your drug for example on only a dozen people, it could by chance be all 12 people for whom it would have worked (or seemed to) from all of the billions of humans on Earth.

Fortunately, one doesn’t have to test anything on everyone, the whole test “population”, because statistically, few hundred cases typically have small enough margin of error, the chance of a misleading fluke result. If you for example claim that you can guess coin tosses better than chance, a single person can by chance guess correctly an infinite amount of times. If you run this experiment with about 400 people who claim it, it only becomes interesting if more than a certain number of them do better or worse than chance (which would be guessing half correct and half wrong).

Typically, you’d see the famous bell curve, with most cases being close to the null hypothesis and few “outliers” lying on the extreme ends. So far so good, but that’s only where the real problems begin. After you’ve made sure to control for all possible sources of error or interference and after you’ve made sure your sample of cases is big enough and selected representatively (not cherry picked or otherwise distorted), measuring how often, how many times, or how much something happens, as opposed to how often or how much it should have happened, may still not actually be a proof of anything.

It only means that the result is “statistically significant”, in a special sense even within statistics as something called “frequentist inference”. It shows at best a correlation, not necessarily causation, meaning that yeah, you did something and something unlikely happened afterwards, but it’s not a solid proof of the link between the two. It also says nothing about the so called “effect size”, meaning it doesn’t say how much of a something happened.

What it means is that many studies in, say, economics or psychology that you may have read about could have been based on these p value proofs, sometimes with borderline small enough p values. Maybe they were showing something that’s a result of something else, showing a “proven” effect that’s negligibly small, or still showing only a statistical artifact, because there’s always random coupled with bad test design choices. In one of the next articles in this series, I’ll also address specifically the problem with studies like these in social sciences often not even being replicable. Skepticism is certainly warranted in this case, but rarely present.

Wanna read the next one about the neutrality of science? Click away.

--

--