Computer systems challenge: how to share innovations faster and more broadly

Computer systems are all around us, comprising a part of the almost-invisible infrastructure that helps us at work and at play. Can we continue the fast pace of innovation in computer systems and, to boot, make our discoveries more replicable?

Some definitions first to keep us on track. Reproducibility is rerunning analyses from another study (by a different set of researchers), using the same data and methods to try to achieve the same results as the original study. On the other hand, replicability attempts to achieve similar findings as those from an earlier study, using different data and analyses.

Why should you care?

Both of these principles are important, with reproducibility often being a precondition for achieving replicability. As members of the tax-paying public, we should expect that some of the research will translate into practice. Such practical innovations will lead to progress in wireless and cellular technology, make our data and voting systems more secure, and provide computing to help fight the big global challenges — predicting and countering pandemics, improving remote work … name your favorite one here.

As a supercharged example in this bizarre time, consider the overflow of research news on proximity or contact tracing to stem the spread of COVID-19. By using GPS data from cell phones, signal strength data from Bluetooth devices, or other information technology, we can tell where a person has been and identify others with whom the person has been in close proximity. But before our policymakers rolled out large-scale interventions using such technology, wouldn’t it have helped to know which lab studies are replicable, and which of the shiny technologies really work to both keep us safe and keep our data private? And wouldn’t it have been nice to have this research completed in double quick time, enabling us to move out of our bunkers sooner and safely?

Another, longer-term issue is: How can research and development in computer science, most of which in the United States is federally funded, be more effectively channeled? We would expect that a taxpayer would want to see more of her dollars go toward innovations that make a difference outside of the lab, rather than toward re-designing and re-creating software and hardware that have been declared as grand successes by their original inventors. Thus, it makes sense for us as researchers to truly stand on the shoulders of giants, or even regularly sized predecessors, and it makes sense for us as members of the wider society to ask for that.

Evidence of the issue

There is growing evidence that reproducibility and replicability initiatives are not keeping pace with the jaw-dropping innovations in computing systems — and interest extends beyond the computer systems community.

Concerns over this trend propelled Congress to add Section 116 of the American Innovation and Competitiveness Act of 2017. The act directed the National Science Foundation (NSF) to engage the National Academies of Sciences, Engineering, and Medicine (NASEM) in a study to assess reproducibility and replicability in scientific and engineering research, as well as to provide findings and recommendations for making that research more rigorous and transparent. The report that was released in 2019, after a series of intense discussions within a wide-ranging group of distinguished members, makes for illuminating reading.

A look at papers associated with top systems conferences sheds light on the challenge involved in promoting the public release of high-quality and highly usable research data. In computer systems, the premier conference on dependability, the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), celebrated its 50th year in 2020. In the last five years, only nine out of 278 papers presented at DSN have released their dataset in a usable format. In the broader systems community, papers presented at two premier conferences — the Usenix Symposium on Networked Systems Design and Implementation (NSDI) and the Usenix Symposium on Operating Systems Design and Implementation (OSDI) — tell a similar story. In the most recent offerings of NSDI (2020) and OSDI (2018), three out of 65 and five out of 47 papers, respectively, released their data source publicly.

NSF and Purdue efforts

The federal funding agencies have been taking notice and seeding several activities, at the level of individual projects and more systemically, to improve the state of affairs of reproducibility of innovations in computer systems. The NSF laid out this goal crisply and cogently in 2016 in a “Dear Colleague Letter” titled “Encouraging Reproducibility in Computing and Communications Research.” This letter was followed by several grants and some encouraging deliverables from these grants in the form of principles, techniques, and tools to improve the situation.

At Purdue, we have created a public open source repository, called FRESCO, with system usage and failure data from the supercomputing resources at Purdue, the University of Illinois at Urbana-Champaign and the University of Texas at Austin. FRESCO has been funded by the NSF — initially through the Computing Research Infrastructure (CRI) Program (award number 151397, 2015–19) and now through the next-generation CISE Community Research Infrastructure (CCRI) Program (award number 2016704, 2020–2023). This work is done jointly with Purdue ITaP personnel, Xiaohui Carol Song and Rajesh Kalyanam.

Purdue’s Conte community cluster, which was one of the supercomputers studied by the researchers.

In science and engineering research, large-scale, centrally-managed computing clusters, or “supercomputers” have been instrumental in enabling the kinds of resource-intensive simulations, analyses and visualizations that have advanced computer-aided drug discovery, high-strength-materials design for cars and jet engines, and disease vector analysis, to name just a few areas of innovation. Such clusters are complex systems comprised of several hundred to several thousand computer servers with fast network connections among them, various data storage resources, and highly-optimized scientific software shared with several hundred other researchers from diverse domains. Consequently, the overall dependability of such systems relies on the dependability of these individual highly interconnected elements, as well as the characteristics of cascading failures. Hence, data from such clusters forms a solid foundation on which to build computer systems innovations.

FRESCO and Monet, the two resources jointly created through NSF awards, are the most comprehensive (in terms of size and richness of data items) and most recently established open source data repositories of their kind. They also provide some simple analytics scripts for asking what-if questions like “What is the mean time to failure (MTTF)?” and “What is the mean time to repair (MTTR)?”.

Looking ahead

The move toward greater reproducibility and replicability — and resulting faster, broader innovation — requires persistence at multiple levels.

· We as a technical community need to keep pushing on our initiatives, which involves continuing to incentivize our researchers. Community awards for software and data artifacts at our premier conferences are a very promising step, as are research awards being granted by our funding agencies to incorporate these principles in our project activities.

· We as the general tax-paying public need to step up and ask our policymakers to take further action to spur public data sharing to enable progress.

· We as media community members need to reflect on whether our computer technology stories are one-offs or illuminate reproducible and replicable discoveries.

· We as policymakers and program officers of funding agencies need to stimulate and celebrate efforts by researchers and practitioners to develop reproducible and replicable innovations.

Only with such concerted efforts will we be able to make our computer systems innovations more quickly and broadly available to the world. The current inflection point due to the craziness of the pandemic clearly highlights the need for such advances.

Saurabh Bagchi, PhD

Professor of Electrical and Computer Engineering and Computer Science

Director of Center for Resilient Infrastructures, Systems, and Processes (CRISP)

Purdue University

Related Links

ECE, ITaP research computing team studies supercomputer reliability

Dependable Computing Systems Laboratory (DCSL)

Center for Resilient Infrastructures, Systems, and Processes (CRISP)

Google fixes smartwatch security problem discovered by Purdue researchers

Wearables in healthcare: are they reliable and secure?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store