Two welcome innovations in Liu and Salganik (2019), “Successes and struggles with computational reproducibility: Lessons from the Fragile Families Challenge”
Liu and Salganik (2019) report on the authors’ use of “software containers… and cloud computing” to enable computational reproducibility for 12 papers in a “special issue of Socius about the Fragile Families Challenge” (p. 1).
Overall, they find that
ensuring computational reproducibility was harder than we had anticipated. There was not one specific thing that caused huge difficulties, but many of the difficulties came from a combination of three inter-related factors: many moving parts, long feedback cycles, and communication difficulty. (p. 15)
The article provides further evidence that ready access to code and data — or a journal policy requiring it — is not sufficient for computational reproducibility.¹ After working closely with authors, Salganik and Liu “completely reproduced the results” of seven of 12 manuscripts (p. 15). This complements Hardwicke et al. (2018), who reproduced results from 22 of 35 articles at the journal Cognition, “but 11 of these required author assistance” (p. 2); and Stodden, Seiler, and Ma (2018), who estimate the reproducibility rate of papers published in Science to be 26%.²
Moreover, there are two things about the Fragile Families challenge, and its reproducibility components, that I am especially happy to see.
The first is that the challenge employs “a research design from machine learning called the Common Task Method (Donoho 2017)” (p.6). If you’ve participated in a Kaggle challenge, this will be familiar, but the idea is to have “a common target for predictive modeling, a common error metric (in our case, mean squared error), and a common data set with both training data available to participants and held-out data used for evaluation” (p. 6). What this gets you is a pre-agreed upon standard for assessing the prima facie validity of a paper’s results.
This does not mean that the paper which minimizes mean squared error in the holdout data is the best paper in the set; but with this common framework established, then models can be evaluated in light of other desirable characteristics (e.g. parsimony and interpretability). This shared standard for evaluation helps readers disentangle “who answers the question best?” from the meta-question of “is this even the right question to be asking?” which, in social science research, is no small feat.
Second, the authors consider a result to be reproducible if they can regenerate a set of predictions within “our error tolerance,” which they set to the “arbitrary standard” of differing from published results by “less than 10⁻¹⁰” (p. 13). This was effective “because most differences were either much larger or smaller…[W]e hope that community standards develop for when two numbers should be considered the same” (p. 13). So do I, and dearly. One challenge of my verification work for Code Ocean is that I have no idea how to figure out what’s “close enough” to consider a published result substantively reproduced; so I don’t even try. Instead, I verify that results are reproducible in a mechanical sense of ‘do the code, data, and environment reproduce something that looks like a result, consistently, each time a person presses Run?’ But whether results are substantively reproduced is much more important and policy-relevant.
The ideal workflow I’d like to see is that authors upload, configure, and share the materials they consider reproducible on Code Ocean; we provide technical support as needed; and then subject matter experts weigh in, evaluating capsules based on widely-agreed upon error tolerances. So I was very happy to see this last step in the paper.
For helpful comments on an early draft, I thank Annette Brown, David Liu, Matt Salganik, and Ben Wood.
 As the authors note, “this is a relatively low standard; it does not guarantee correctness of the code or the scientific validity of the claims” (p. 7).
 For more estimates of discipline-specific reproducibility rates, see Colberg and Proebsting 2016, Eubank 2016, and Wood, Müller, and Brown 2018.