More Outcomes, More Problems: A Practical Guide to Multiple Hypothesis Testing in Impact Evaluations (Part 2)
This is part two of a three-part series on multiple hypothesis testing in impact evaluations. In the first post, I wrote about when multiple hypothesis testing is necessary. In this second post, I write about strategies to avoid multiple hypothesis testing. And in the third post, I’ll speak about common methods for multiple hypothesis testing.
Correcting for multiple hypotheses can drastically decrease the power of an evaluation, and can also be kind of a pain to carry out. Really, no one ever wants to do it. So, what steps can you take to avoid it? As outlined in Part 1 of this series, the most common situation in which the need for multiple hypothesis tests comes up is when an intervention has many outcomes of interest, and one wants to draw a conclusion (such as declaring the intervention a “success”) as long there are significant treatment effects) on any of the outcomes. The key is avoiding this situation. We discuss two common solutions: indexing and prespecification.
Indexing
The first possibility is indexing, meaning you create a single measure that is a combination of all your outcomes. This normally involves two steps: first you standardize all your outcomes (usually to mean zero and standard deviation of one) so that they are comparable, and then you create a weighted average of all the outcomes. Voila, many outcomes have become one!
But there are a couple complications here. First, how do you determine the weights? One way would be for the researcher to simply pre-specify the weights depending on some measure of “importance”. But this may or may not make sense depending on the context. Another approach would be to use an algorithm based on the variance-covariance matrix of the outcomes such as using inverse covariance weighting or taking the first principal component. These approaches try to create a single variable that captures as much of the variation of the set of outcomes as possible. (I defer to this great post by Cyrus Samii to help understand the difference between the two approaches).
But even more important is whether an index is really useful or not. Especially when speaking with policy-makers, interpreting results from an index is extremely difficult. Imagine the following conversation: ‘Yes Minister, your community-driven development program was successful in increasing an inverse-covariate-weighted matrix of transport time, latrine access, and vegetable income.” I imagine there will be a lot of follow-up questions. Practically, it should be acceptable for a researcher to pre-specify an index as a primary outcome variable, yet still report (unadjusted) estimates of treatment effects for each of the index components. However, these component regressions would have to be viewed as “exploratory” analysis with the purpose of explaining the drivers of the index. And it certainly wouldn’t be valid to trumpet significant results on certain outcomes if there was no significant result on the index. This is right back to the p-hacking situation we were trying to avoid.
Overall, the indexing solution is most attractive in a couple of cases. The first is when all outcomes are closely related, as this makes the index easier to interpret and explain. For instance, let’s say you have a survey module that looks at different dimensions of food security (such as going to bed hungry, skipping meals, consuming animal-sources protein, etc.) Here is makes sense to have an index because these questions are all trying to get at the same fundamental outcomes, which is easily explained as “food security”. Another place where indices are likely useful are for measuring programs with uncertain interventions where each which will lead to different outcomes (such as community driven development programs). In this case, an evaluation will likely not be powered for any conceivable individual outcome, because all communities are working on different projects with different goals. Therefore it is likely necessary to create an outcome variable that draws from many possible outcomes, which you get in an index.
Pre-specifying
While most self-respecting impact evaluations these days have a pre-analysis plan, careful pre-specification can decrease or eliminate the need for multiple hypothesis tests. Simply having a pre-analysis plan should rule out the most egregious forms of p-hacking, where the researcher tests many outcomes, and just reports the significant ones. Since the researcher is bound by the outcomes they listed in the pre-analysis plan, they must report all of them. (A good explanation and example of this approach is found in Casey et al (2012), gated, ungated. Additionally, there is a post this week from David McKenzie at the Development Impact Blog with some great dissection of pre-analysis plans.)
But if there are still multiple pre-specified outcomes of interest, then there may still be a need to correct for multiple hypotheses. One common method to get around this is to pre-specify a single outcome as the “primary” outcome, and other outcomes as “secondary” outcomes. In this case, the researcher is essentially putting all his/her chips behind this one outcome, and the intervention may only be described a “success” if there are significant effects on this primary outcome. Secondary outcomes can provide additional context and color, but in general cannot be used independently as measures of success.
When is this appropriate? Well, it has to be a case where the researcher can make the argument that this single outcome is simply more important than all the others. In a decision-focused evaluation (research focused on a specific implementer’s needs in a specific context), having a single primary outcome makes sense if the policy-maker can be swayed by changes in just this single outcome. For other types of research, the focus on a single outcome may be swayed by various evidence in the theoretical or empirical literature. For instance, household consumption is generally seen as the most reliable survey-based measure of welfare, as it tends to be correlated with many outcomes policy-makers care about. (Meyer and Sullivan 2003) Therefore, for some interventions (such as cash transfers) that are expected to affect consumption as well as other measures of well-being, it may be appropriate to look at consumption as the sole primary outcome.
What if you just can’t pre-specify one primary outcome, but can get it down to a very small number, like two or three? As discussed in Part 1, if these outcomes are influencing different decisions, then there is no need for multiple hypothesis corrections. If you as a researcher are convinced this is the case, I would recommend putting this justification into your pre-analysis plan, and hopefully this will convince an editor that you don’t need to do the corrections.
For research written for a general audience, the norms are somewhat unclear. It seems at least somewhat accepted that if researchers pre-specify a small number of outcomes then they might not need to test for multiple hypotheses- we simply see this in journal articles all the time. However, what counts as “small” is undefined.
For instance, consider this interesting passage from the recent cash benchmarking study in Rwanda from Craig Mcintosh and Andrew Zeitlin: “Because we restrict the analysis in this paper to the pre-specified primary and secondary outcomes only, we do not correct the results for multiple inference.” (Mcintosh and Zeitlin 2018)
But is this legitimate given that there are three primary and eight secondary outcomes? I’d argue that it is hard to tell without understanding who the intended audience is and how these results will be used. Certainly the assertion that one doesn’t need to correct for multiple inference because all outcomes were pre-specified doesn’t hold water. What if there were 1000 primary outcomes? And it seems to go against the norms in the academic unconditional cash transfer literature where multiple hypothesis correction is generally conducted (as in this literature there are many outcomes from which one could declare the intervention success, for example Haushofer and Shapiro [2016]).
I reached out to the authors for some more color on this decision, and was fortunate enough to receive this comprehensive reply from Craig:
“The Report to USAID for Gikuriro is a somewhat unusual document, in that the entire benchmarking process was a multi-party exercise involving very different implementers with quite distinct theories of change to agree a priori to what the objectives of the interventions were. We used the PAP as a commitment device to organize what we would prioritize and report on. There were at least three distinct groups with different theories of change and different outcomes that they cared about in that study, and so given that we had economic, health, and social learning camps at the table with distinct causal theories we collectively agreed not to adjust for multiple inference in the writing of the formal report to USAID.
We are now rewriting the report as an academic paper and are doing the Anderson FDR correction within families of outcomes to conform to what we agree to be the current academic consensus. While there is a legitimate argument not to make multiple inference corrections when there are multiple distinct causal theories on the table (as Cyrus Samii has argued) we agree with you that current economics convention is that it is more conservative and therefore correct to do this, so we will be doing so in the (soon to be released) academic paper.”
Felt pretty validated by this response because it really drives home the point I’ve been trying to make that you have to think carefully about your audience. For the USAID report it appears the authors had specific decision-makers in mind, so it functions like a decision-focused evaluation[SM3] . By considering the needs of the decision-makers, the authors were able to come to the conclusion that no multiple hypothesis correction is needed. But for the academic paper, they need to consider a more general audience so decide to make the multiple inference correction. There are many shades of grey here!
If you find yourself in the second situation and need to make multiple inference correction, this next post provides guidance on how to do it.
References:
Casey, Katherine, Rachel Glennerster, and Edward Miguel. “Reshaping institutions: Evidence on aid impacts using a preanalysis plan.” The Quarterly Journal of Economics 127.4 (2012): 1755–1812.
Haushofer, Johannes, and Jeremy Shapiro. “The short-term impact of unconditional cash transfers to the poor: experimental evidence from Kenya.” The Quarterly Journal of Economics 131.4 (2016): 1973–2042.
McIntosh, Craig, and Andrew Zeitlin. “Benchmarking a child nutrition program against cash: experimental evidence from Rwanda.” Working paper, 2018.
Meyer, Bruce D., and James X. Sullivan. Measuring the well-being of the poor using income and consumption. No. w9760. National Bureau of Economic Research, 2003.