TheAnal-lyticPhilosopher
10 min readMay 18, 2024

Severely Lacking Power — the Trivialities and Absurdities of Severity Testing

Since severity testing is an oxymoron hybrid of power determination and null hypothesis testing yielding a ‘powerless’ and uninformative p-value (“severity”), it could be expected that interpreting its results would lead, minimally, to things already known from the p-value of the original hypothesis test, or maximally, to absurdities. Severity fulfills both of these expectations, and it might even exceed one’s expectations when it comes to rejecting the original statistical null in well-powered versus under-powered tests.

First, consider a statistical test when the original statistical null is not rejected (Ms. Rosy or their stock example). One may still wonder if a small discrepancy from the null exists; that is, one may still want to know for substantive reasons how well one can infer that the true parameter is somewhere within this discrepancy (12.1 or 12.2, respectively, from 12).

So, assume one fails to reject the null with a p-value of .07, where the parameter estimate would be 12.3 (these are the values in the Ms. Rosy example).

Assume too that one has a high-powered test: it could reasonably detect a relatively small effect size, one larger than the discrepancy of interest from the null, but not so well one the size of the upper boundary of the discrepancy.[1]

Under the normal interpretation of p-values and power, one would be foolish to “affirm the null” or to say that the true parameter probably falls within a discrepancy of interest because given the data, there seems to be an effect a good bit larger than the interval, even though it’s not technically significant. One faces a decision on how to proceed next, to be sure, but regardless of the course one takes, that option — under standard interpretations of p-values and power — wouldn’t include affirming the null more confidently.

And severity testing would apparently agree. In this case, SIA (b) would apply, as it states that if there is a very low probability that x would have been larger than it is, then the assertion that the null is true or that the true value of the parameter falls within the interval of interest can only be made with low severity — to wit, it really shouldn’t be made at all (Ms. Rosy gets a severity value of .16).

True enough, one might say — but the p-value already tells you that. Since it isn’t significant enough to reject the null, one shouldn’t reject it, and since the p-value is low but not quite significant, one certainly shouldn’t affirm it either.

But severity adds something more; it say one shouldn’t affirm that the parameter falls within the discrepancy of interest. That is, it says it might even be larger than that, and it gives a value for the worth of this affirmation.

Per what was noted last time, however, severity doesn’t really reveal this; it doesn’t reveal anything beyond the original p-value: it’s just a second p-value pro-rated for the ‘discrepancy null’ in a second null hypothesis test. As such, it too will always be low when the declining p-value suggests that affirming the null, or small discrepancies from it, is an increasingly bad idea. In fact, the original p-value may be more valuable in this respect, since severity always lags p, i.e., severity > p, and a lower p-value makes the point more forcefully than the relatively higher severity.

So even where severity apparently applies, it is no more helpful than the original p-value, and perhaps is even less helpful, considering that is has less information about the original null than the original p-value.

But consider the recommendation in SIA (a), which states: if there is a very high probability that x would have been larger than it is, then one can assert with high severity that the null is true and/or that the true parameter estimate lies within the interval from the null.

Except the problem is, this isn’t going to happen in a statistical hypothesis test: as x increases, the p-value decreases, and so does severity. Thus, the condition either 1) can’t be met because the probability that x might be larger is always going to be less than the probability of x, or 2) it amounts to the nonsense inference that because x is small, severity and p are high, therefore severity for a slighter larger x will still be quite high, therefore affirm the null and/or the interval from it with “high” severity. This inference, however, is precisely the fallacy of naively ‘affirming the null’ that severity seeks to avoid. In fact, it commits this fallacy with greater conviction, as it were, because severity by stipulated computation is always going to be larger than the p-value; thus the evidence for the severe affirmation appears greater.

But it’s not. Using severity to affirm a null in a test with a high p-value only commits the original fallacy of affirming the null with greater computational force than the original p-value — or alternatively, severity testing commits this fallacy for a second null tested, the one represented by the upper bound of the discrepancy interval of interest.

In either interpretation, affirming the null is the same fallacy. As a post-hoc, data-driven test, severity should minimally reveal something more than is already known and maximally avoid endorsing an inference as bad as the fallacy it seeks to correct.

Under SIA (a) and (b), it fails in both respects.

Second, consider the case when the null is rejected. Use the same stipulations as above, except now p=.03, i.e., x is greater than 12.4. Here what one is asking is reversed. Instead of asking: 1) with how much warrant can one say that a discrepancy falls within an interval from the null, one is asking (2) with how much warrant can one say an effect is greater than a small discrepancy from the null.

Never mind for now the obvious answer of measuring the effect size from the data and seeing that it already is larger than the “small” discrepancy of substantive interest (by the stipulations in the exemplary uses of severity, it always will be). Assume one lacks the wherewithal to do this, or one still wants to know if doing this is a good idea. What does severity testing recommend for interpreting the data post-test?

Under SIR (a), severity testing stipulates that if there is a very low probability of an effect so large as x, then the assertion that the true parameter falls outside of the interval of interest passes with high severity.[2] That much seems right, as entirely unhelpful as it is. For there will always be a relatively low probability of getting an effect as large as, or larger than, x; that is why a test is statistically significant. So, this stipulation will always hold, and severity will always start out high and only get higher the larger the effect gets (it starts at about .93 and .85 for 12.1 or 12.2, respectively).

In this respect, though, severity only reiterates what is already known from simply measuring the effect and looking at its p-value, but in another respect, severity is less informative than that, in that the p-value already indicates the reality of the effect with a probability of error less than the equivalent probability of erroneously applying severity (i.e., severity’s equivalent of a Type I or Type II error).

So, why use “severity” at all, except to be unnecessarily conservative over an interval that may or may not exist?

In fact, the only conceivable situation in which severity testing might determine something “new” would be if the interval of substantive interest is larger than both the effect size required for statistical significance and the effect size determined by the data. Never mind for the moment that this most likely (but not always) changes the exemplary uses for severity; assume it occurs in an extremely high-powered test that finds a tiny significant effect within the discrepancy of interest. In this limiting case, using severity would essentially amount to inferring whether, based on a small effect, a larger effect is warranted from the same data.

But that inference would be absurd; it would be like the drunk guy looking only under the streetlight for his lost keys because that is where the light is. Since calling that a fallacy would be charitable, under SIA (a) severity is either trivially applicable or completely reckless.

Under SIR (b), the situation apparently changes. Here it is stipulated that if there is a very high probability of obtaining so large an effect as x, then the assertion that the true parameter falls outside the interval of interest passes with low severity.

The stipulation is asinine. Obviously it will never apply because the p-value for the significant effect is already very low, otherwise it wouldn’t be statistically significant (and all else being equal, larger effects mean even lower p-values and therefore higher severity under SIR (a) — so it’s contradictory too). Equally obvious: the effect is ‘outside’ the interval of interest because the interval of interest is smaller than the effect found, and the original test already suggests this effect is real more clearly than severity suggests it may not be…and so forth.

One wonders that this condition could even be proposed.

So much for the four conditions of severity testing. Since it has already been shown conceptually, and to a lesser degree mathematically, how severity testing fails to live up to its aspirations as a power-determined, post-hoc data analytic strategy, going through each of these scenarios and determining their consequences may seem unnecessary.

But as suggested last time, the intuition severity seeks to develop is an important one, and lest one be tempted by it, it is important to see those consequences brought out in their own terms. For one could always be wrong about rejecting a null hypothesis and thus affirming the reality of an effect, just as one can always be wrong about not finding an effect and thus assuming, however tentatively, that one does not exist. One might want a reliable way of knowing how right or how wrong one is, a way more reliable than the original statistical test — that is, one may want a post-data analytic strategy based on a sound meta-statistical principle.

Severity, however, fails as this strategy fails completely, in that it offers no more information than the original p-value, and frankly even less, if one takes seriously its full implications or the absurd consequences in two of its four applications. It will be asserted here without argument that all post-data analytic strategies like severity — i.e. using the data from the test to assess the reliability of the test itself — are doomed to fail.

Now, it could be pointed out that so far nothing about severity testing exceeds the expectations one might have of a test that tries to wring more information out of data with a post-hoc hypothesis test that merely sets a new null at the upper boundary of a discrepancy of interest, one deviating slightly from the original null, then tests that null and calls that result “new information” about the original hypothesis. But one would be wrong. For severity offers an original interpretation of power, and in this interpretation the lack-luster consequences of severity testing reach heights.

For instance, severity testing, as described elsewhere, stipulates that “when a test has very high power for tiny discrepancies from the null, […] rejection of the null provides less (not more) evidence for the presence of a substantive discrepancy.” Conversely, a test with low power to detect discrepancies from the null finds “more” evidence of the discrepancy when it rejects the null (Spanos 2008). Or as stated in the original paper: “the higher the power of the test to detect discrepancy y, the lower the severity for inferring u>u1 on the basis of a rejection of H0” (Mayo and Spanos, 2006).

Seriously? So according to severity logic, if an instrument very sensitive to effects in a wide range of sizes, including very small ones, finds an effect larger than the small discrepancy of interest, then that is less evidence that this effect is larger than the smaller discrepancy. And conversely, if an instrument less sensitive to effects in general — to wit, it has poor resolution for detecting things as small as the discrepancy — finds an effect larger than the discrepancy, that is more evidence that the true effect is larger than the discrepancy. The less reliable instrument becomes more reliable, and vice versa?!…this turns the standard understanding of power — and common sense — on its head.

As a principle for guiding inferences, it amounts to looking in the light of the proverbial streetlight after taking one’s glasses off, or even, perhaps, putting on glasses with the wrong prescription. That effort should speak for itself, and that there is a large-n “paradox” that could be noted (as the authors do) is neither here nor there for this issue. That merely means one must always be mindful of distinguishing substance from significance. It certainly doesn’t mean that what one finds in a powerful test is less reliable than what is found in a test less sensitive to finding effects [3].

Whatever one might make of this analysis — and in a legitimate sense it seems downright abusive — a beating such as this isn’t solely the fault of the deliverer. Severity testing brings it on itself as a post-data analytic strategy because of its completely unhelpful or plainly absurd consequences — consequences that really should have been apparent to anyone not so wrapped up its inane logic as to say this to someone else: “The only way to go is to start with sound principles of statistical reasoning and let them guide you! A poor way to proceed, I hope to convince you, is starting with an invented computation (which might bear some relation to a standard one) and getting all knotted up in paradox, concluding (erroneously) that the standard notion is paradoxical” (Mayo, Error Statistics blog, Nov 12, 2011).

Some errors, it seems, are ordained in the exhortations against making them.

[1] Never mind the objection that one shouldn’t be running a test that isn’t ‘powered’ to the discrepancy of interest, as opposed to some other effect size; the author’s examples aren’t even this well powered. But let’s assume the stronger case that they are

[2] The places of “high” and “low” in interpreting significant results is asymmetrical with their use in interpreting non-significant results because the authors redefine severity to 1 — severity when interpreting significant results. Go figure.

[3] Or carry the logic to its full consequences, all the way up to sampling most of the population yielding the least reliable parameter estimators down to sampling 2 or 3 instances yielding the most reliable. It’s just ridiculous.