Blue Mountain supercomputer at Los Alamos National Laboratory, decommissioned 2004. (Public domain)

Why should I believe your supercomputing research?

Supercomputing propels science forward in fields like climate change, precision medicine, and astrophysics [1]. It is considered a vital part in the scientific endeavor today.

The strategy for scientific computing is, in broad strokes, the same as it was at its historical beginnings during WWII, with John von Neumann and his pals at Los Alamos. You start with a mathematical model — trusted to represent some physical phenomenon — , transform it into a computable form, and then express the algorithm into computer code that is executed to produce a result. This result is inferred to give information about the original physical system.

In this way, computer simulations originate new claims to scientific knowledge. A question we shouldn’t skirt around was asked by Eric Winsberg in his book “Science in the Age of Computer Simulation” (2010) —When do we have evidence that claims to knowledge originating from simulation are justified?

When I was invited to be part of a mini symposium on “The Reliability of Computational Research Findings,” at the 2014 SIAM Conference on Uncertainty Quantification, I drew inspiration from Winsberg to assert that this is the reason we care about reproducibility in computational science: computer simulations create scientific knowledge. [2]

I have been part of the conversation about reproducibility in computational science, and attempt to conduct my research in a reproducible way. My group makes all research code open source and we publish data, analysis and plotting scripts, and the figures included in our research papers. All of this, with the aim of facilitating reproducibility of our results. [3]

But what help is this, when we’ve used specialized hardware, not available to interested readers? What does reproducibility even mean, in the context of supercomputing, when producing the results requires computing systems or allocations beyond the reach of peers? Even the very authors of the research may not have access to the same computers to repeat the simulation. How do we trust results from simulations that cannot be repeated due to resources?


In many engineering applications of computer simulations, we are used to speaking about verification and validation (often abbreviated V&V). These are well-defined technical terms: verification means confirming that the simulation results match the solutions to the mathematical model; validation means confirming that the simulation represents well the physical phenomenon. In other words, V&V separates the issues of “fitness of the solver” and “fitness of the model” — or solving the equations right, vs. solving the right equations.

If a published computational research reports on completing a careful V&V study, we are likely to trust the results more. But is it enough? Our goal is really to establish that a simulation — embodying the combined influence of mathematical models, numerical methods, and computational workflows — produces reliable results, which we trust can create knowledge.

How do we show evidence that a simulation gives reliable data about the real world?

Thirty years ago, the same issues were being raised about experiments: How do we come to rationally believe in an experimental result? A. Franklin wrote about the neglect of experimental science (in particular, physics) from the treatment of philosophers, and discusses the many strategies that experimental scientists use to provide grounds for rational belief in experimental results. For example: confidence in an instrument increases if we can use it to get results that are expected in a known situation. Or we gain confidence in an experimental result if it can be replicated with a different instrument/apparatus. [4]

The question of whether we have evidence that claims to scientific knowledge stemming from simulation are justified — the epistemology of simulation — is not so clear as V&V. When we compare results with other simulations, for example, simulations that used a different algorithm or a more refined model, this does not fit neatly into V&V … but it does help us believe in our results!

And our work is not done when a simulation completes — next: the data requires interpretation, visualization and analysis. All of these steps are crucial for reproducibility (but are not part of V&V). We usually try to summarize qualitative features of the system under study, and generalize these features to a class of similar phenomena (here we deal with managing uncertainties).

It is all complicated, labor-intensive, involves physical intuition, often relies on visualizations, requires judgements and value questions.

The new field of uncertainty quantification (UQ) aims to give objective confidence levels for the results of simulations. Its goal is to give mathematical grounds for confidence in the results: in that sense, it is a response to the complicated nature of justifying the use of simulation results to draw conclusions (accept or reject a hypothesis, for example). UQ presupposes verification and informs validation. Verification deals with the errors that occur when converting a continuous mathematical model into a discrete one, and then to a computer code. There are known sources errors — truncation, round-off, partial iterative convergence — and unknown sources of errors — coding mistakes, instabilities. Uncertainties stem from input data, modelling errors, genuine physical uncertainties, random processes — UQ is thus associated with the validation of a model. It follows that sufficient verification should be done first, before attempting validation. But is this done in practice? Always? What is meant by “sufficient”? [5] Verification provides evidence that the solver is fit for purpose, but this is subject to interpretation: the idea of accuracy is linked to judgements.

Many articles discussing reproducibility in computational science place emphasis on the importance of code and data availability. But making code and data open and publicly available is not enough. To provide evidence that results from simulation are reliable, building confidence in the science results, requires solid V&V expertise and practice, reproducible-science practices, and carefully reporting our uncertainties, our judgements.


Supercomputing research should be executed using reproducible practices, taking good care of documentation and reporting standards, including appropriate use of statistics, and providing any research objects needed to facilitate follow-on studies. Even if the specialized computing system used in the research is not available to peers, conducting the research as if it will be reproduced increases trust and helps justify the new claims to knowledge. [6]

Code availability by itself does not ensure that the research is reproducible in supercomputing scenarios. Computational experiments often involve deploying precise and complex software stacks, with several layers of dependencies. Multiple details must be taken care of during compilation, setting up the computational environment, and choosing runtime options.

Thus, making available the source code (with a detailed mathematical description of the algorithm) is a minimum pre-requisite for reproducibility: necessary but not sufficient. We also require detailed description and/or provision of:

  • dependencies
  • environment
  • automated build process
  • running scripts
  • post-processing scripts
  • secondary data generating published figures

Not only does this practice facilitate follow-on studies, removing roadblocks for building on our work: it enables getting at the root of discrepancies if and when another researcher attempts a full replication of our study. Quoting Donoho et al. (2009):

“The only way we’d ever get to the bottom of such a discrepancy is if we both worked reproducibly and studied detailed differences between code and data.”

How far are we from achieving this practice as standard? A recent study surveyed a sample (admittedly small) of papers submitted to a supercomputing conference: only 30% of the papers provide a link to the source code, only 40% mention the compilation process, and only 30% mention the steps taken to analyze the data. [7]

We have a long way to go.


Prof. Lorena Barba is currently a visiting scholar at the Berkeley Institute of Data Science, with a focus on Reproducibility and Open Science, and Education.


Notes

[1] See NCSA’s news release on the recent discovery of gravity waves, and the feature in HPC Wire.

[2] Slides and video of my talk are available online. This post is based on that talk, with some expanded points.

[3] See my “Reproducibility PI Manifesto” (2012) on Figshare, https://dx.doi.org/10.6084/m9.figshare.104539.v1

[4] A. Franklin, “The Neglect of Experiment” (1986) http://catdir.loc.gov/catdir/samples/cam031/86002604.pdf

[5] See Ian Hawke’s post “Close Enough,” http://ianhawke.github.io/blog/close-enough.html

[6] “…if everyone on a research team knows that everything they do is going to someday be published for reproducibility, they’ll behave differently from day one.” — Donoho, D.L. ; Maleki, A. ; Rahman, I.U. ; Shahram, M. ; Stodden, V. (2009),”Reproducible Research in Computational Harmonic Analysis,” Computing in Science and Engineering Vol. 11 (1):8–18 http://dx.doi.org/10.1109/MCSE.2009.15

[7] Carpen-Amarie, A., Rougier, A. and Lübbe, F. (2014), “Stepping Stones to Reproducible Research: A Study of Current Practices in Parallel Computing” In EuroPar 2014, L. Lopes et al. (eds.), LNCS 8805:499–510; Springer: Switzerland.

Like what you read? Give Lorena A Barba a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.