More Outcomes, More Problems: A Practical Guide to Multiple Hypothesis Testing in Impact Evaluations (Part 3)

Daniel Stein
IDinsight Blog
Published in
5 min readNov 12, 2019

This is part three of a three-part series on multiple hypothesis testing in impact evaluations. In the first post, I wrote about when multiple hypothesis testing is necessary. In the second post, I wrote about strategies to avoid multiple hypothesis testing. And in this third post, I’ll speak about common methods for multiple hypothesis testing.

A tablet with SurveyCTO from IDinsight’s Givewell Beneficiary Preferences project. ©IDinsight/Emily Coppel

As outlined in parts 1 and 2 of this series, researchers frequently attempt to design their studies to avoid multiple hypothesis testing. But if you are in a situation where you definitely need to do multiple hypothesis testing, what next? (For instance, let’s say the program you are testing will be scaled up if it improves household consumption or children’s nutrition.) There are a lot of different options for procedures you could implement, and the norms in economics seem to be continually evolving. In this post I try to guide readers towards the current norms (as far as I can tell).

The first choice to make is whether you want to control the “family-wide error rate” (FWER) or the “false detection rate” (FDR). Let’s say we accept a probability of type I error (alpha) of 5%. In an FWER correction, we would adjust our rejection criteria such that there is less than a 5% chance that any of our null hypotheses are incorrectly rejected. An FDR correction is less conservative. Here we would adjust our rejection criteria such that in expectation less than 5% of our null hypotheses will be incorrectly rejected.

So which one should you use? FWER is in general more conservative (though see some discussion on this later), and can cause pretty drastic reductions in power as the number of hypotheses increases. Therefore, it is mostly used when the number of hypotheses is relatively small. FDR allows power to stay relatively constant as the number of hypotheses increases, and is popular in cases with many tests (for instance, in detecting matches in the human genome, where there may be 10,000 simultaneous tests). But what is the threshold? I don’t think there is a good answer to this, but I do think that has been a recent shift in the norms of economics towards using the FDR procedures, even with only a handful of tests being done.

For instance, consider the unconditional cash transfer literature, which is a common case where multiple inference testing is necessary. Early papers (such as Haushofer and Shapiro [2016]) corrected for FWER when comparing across their main outcomes. However, more recent papers in the literature [such as Baird et al (2019), Haushofer et al (pre-analysis plan from 2017, paper forthcoming), and McIntosh and Zeitlin (in the forthcoming version of this 2018 working paper)] are correcting instead for FDR. Given that FDR generally gives more power and is easier to implement, this seems like it could be the default choice for researchers going forward.

Now, how does one do these corrections in practice?

FWER

Any textbook that talks about correcting for FWER will probably talk about the Bonferroni correction (along with its cousins like the Holm-Sidak correction). But practically, these corrections are just too conservative to use in practice, assuming no correlation between outcome variables. With even a small number of outcome variables, power decreases rapidly, and therefore simple approaches like Bonferroni are just not practical.

Assuming your outcome variables are positively correlated, you will gain much more power if you use a procedure that takes this correlation into account. The go-to approach seems to be the resampling technique suggested by Westfall and Young (2003). There is a clear description on how to implement this procedure in practice in Anderson (2008), and also there is a user-written STATA command ‘wyoung’ (Reif [2018]) that provides an accessible implementation.

FDR

The go-to procedure for controlling FDR appears to be the straightforward procedure developed by Benjamini and Hochberg (1995), or the “sharpened” version in Benjamani et al (2006). The latter is a bit more powerful but requires stronger assumptions on the correlation structure of your outcomes. (The more powerful 2006 version seems to be generally accepted, so I’d go with that if I had no reason to believe its assumptions were violated). Again, these procedures are clearly described in Anderson (2008). As for implementation, Michael Anderson has some helpful STATA code on his webpage, and there is also a user-written command from Roger Newson called ‘multproc’ that can do these procedures. R users can check out the multtest package, which does FDR corrections as well as the FWER corrections described earlier.

Finally, a mystery. I mentioned before that FDR corrections will in general retain more power than FWER. However, there is a problem with this assumption when comparing the two “standard” approaches in impact evaluations described above. The Westfall-Young procedure takes correlation of the outcome variables into account, and in the edge case where the outcomes are perfectly correlated there should be no difference in inference with and without multiple hypothesis correction. However, the FDR procedures described above do not take correlation of outcomes into account, so therefore they will still result in reduced power even with perfectly correlated outcomes. So does controlling for FDR actually result in more power than controlling for FWER? Seems like probably not in all cases. This leads to two questions:

1. What is the level of correlation between outcomes where doing a Westfall-Young FWER correction gives more power than a BH FDR correction?

2. Is there some procedure to account for FDR that takes into account the correlation structure of outcomes so that we don’t have this trade-off?

I don’t know the answers to the above questions. Any commenters have ideas? Otherwise, I’ll have to get on some simulations.

I hope this series of posts have been helpful. I’d say the main take-away is that researchers should always consider multiple hypothesis testing during their research design, and plan accordingly. This might mean taking steps during pre-specification to avoid the need for multiple testing, or alternatively specifying what corrections will be done at the analysis stage and ensuring that the design still has enough power.

References:

Baird, S., McIntosh, C., Ozler, B., “When the Money Runs Out: Do Cash Transfers Have Sustained Effects on Human Capital Accumulation?” Journal of Development Economics, Vol. 140, Sep. 2019, pp. 169–185

Benjamini, Y., and Hochberg, Y. (1995), “Controlling the False Discovery Rate,” Journal of the Royal Statistical Society, Ser. B, 57, 289–300.

Benjamini, Y., Krieger, A., and Yekutieli, D. (2006), “Adaptive Linear Step-Up Procedures That Control the False Discovery Rate,” Biometrika, 93, 491–507.

Haushofer, J., & Shapiro, J. (2016). The Short-Term Impact of Unconditional Cash Transfers to the Poor: Experimental Evidence from Kenya. Quarterly Journal of Economics, 131(4), 1973–2042

Haushofer, J., Miguel, E., Neihaus, P., Walker, M. GE Effects of Cash Transfers: Pre-analysis plan for household welfare analysis. (2017)

Julian Reif, 2017. “WYOUNG: Stata module to perform multiple testing corrections,” Statistical Software Components S458440, Boston College Department of Economics, revised 06 Jun 2018.

McIntosh, Craig, and Andrew Zeitlin. “Benchmarking a child nutrition program against cash: experimental evidence from Rwanda.” Working paper, 2018.

Westfall, P., and S. Young. 1993. Resampling-based Multiple Testing: Examples and Methods for p-value Adjustment. John Wiley & Sons, Inc

--

--

Daniel Stein
IDinsight Blog

Chief Economist at IDinsight. Passion for generating and using research to drive better policy.