The Case for Cloud in Science

Three new papers lay out the opportunities and challenges

Ryan Abernathey
pangeo
5 min readMay 6, 2021

--

Pangeo got involved in cloud computing almost by accident. In our original 2017 NSF proposal, the program manager asked us to trim our budget. So we removed the servers we had planned to buy and instead asked to be included in the NSF BIGDATA program (now defunct), which provided direct grants of credits from the big cloud providers. What followed was a period of intense experimentation, building things, breaking things, and growing a community around cloud-native geoscience research.

Four years later, we are convinced that cloud computing holds the power to transform scientific research and data science education in fundamental, positive ways. The central argument is that, rather than having thousands of researchers working in isolated data / computing silos (local infrastructure), we can work together inside a shared infrastructure using cloud-based data and data-proximate cloud computing. Working this way will enhance productivity, collaboration, reproducibility, and access / inclusion, while opening the door to ambitious new data-intensive research questions. There are many challenges that have to be resolved to realize this vision — in particular, how do we keep our cloud infrastructure open, modular, and not locked-in to a specific provider? But we are confident the scientific community, working with funding agencies and cloud providers, can face and overcome these challenges.

We have recently published three papers which try to lay out this vision, and the associated challenges, in detail. The point of this blog post is to highlight these papers and link them together as part of a broader vision.

The first paper is a broad overview of the transformation taking place in scientific infrastructure:

Science Storms the Cloud

Gentemann, C. L., Holdgraf, C., Abernathey, R., Crichton, D., Colliander, J., Kearns, E. J., Panda, Y. and R. Signell (2021). “Science Storms the Cloud”. AGU Advances, 2, e2020AV000354. https://doi.org/10.1029/2020AV000354

The core tools of science (data, software, and computers) are undergoing a rapid and historic evolution, changing what questions scientists ask and how they find answers. Earth science data are being transformed into new formats optimized for cloud storage that enable rapid analysis of multi‐petabyte data sets. Data sets are moving from archive centers to vast cloud data storage, adjacent to massive server farms. Open source cloud‐based data science platforms, accessed through a web‐browser window, are enabling advanced, collaborative, interdisciplinary science to be performed wherever scientists can connect to the internet. Specialized software and hardware for machine learning and artificial intelligence are being integrated into data science platforms, making them more accessible to average scientists. Increasing amounts of data and computational power in the cloud are unlocking new approaches for data‐driven discovery. For the first time, it is truly feasible for scientists to bring their analysis to data in the cloud without specialized cloud computing knowledge. This shift in paradigm has the potential to lower the threshold for entry, expand the science community, and increase opportunities for collaboration while promoting scientific innovation, transparency, and reproducibility. Yet, we have all witnessed promising new tools which seem harmless and beneficial at the outset become damaging or limiting. What do we need to consider as this new way of doing science is evolving?

Fig. 1 in Gentemann et al. (2021): Science is changing as data, software, and computers are coming together on the cloud. Scientists can access massive cloud computing resources through a web browser window, effectively putting a super‐computer into any internet‐connected device.

The next paper focuses on the changing nature of data repositories in the cloud / big-data era:

Cloud-Native Repositories for Big Scientific Data

R. P. Abernathey, Tom Augspurger, Anderson Banihirwe, Charles C. Blackmon-Luca, Timothy J. Crone, Chelle L. Gentemann, Joseph J. Hamman, Naomi Henderson, Chiara Lepore, Theo A. McCaie, Niall H. Robinson, and Richard P. Signell, “Cloud-Native Repositories for Big Scientific Data,” in Computing in Science & Engineering, vol. 23, no. 2, pp. 26–35, 1 March-April 2021. https://doi.org/10.1109/MCSE.2021.3059437

Scientific data have traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow toward the petabyte scale. A “cloud-native data repository,” as defined in this article, offers several advantages over traditional data repositories — performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access and inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing’s full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.

Fig. 2 from Abernathey et al. (2021): Pangeo architecture diagram. The data repository is hosted in cloud object storage (left), in the Zarr format. Compute nodes inside a Kubernetes cluster (right) fetch data and metadata from the object store. Users connect to the system via Jupyter and write interactive data analysis code in Xarray, which dispatches computations on an adaptively scaling Dask cluster.

The final paper is a technical report commissioned by the European Commission focusing on the practical aspects of implementing a cloud-based data system:

Opening new horizons: How to migrate the Copernicus Global Land Service to a Cloud environment

Abernathey, R., Neteler, M., Amici, A., Jacob, A., Cherlet, M. and Strobl, P., Opening new horizons: How to migrate the Copernicus Global Land Service to a Cloud environment, EUR 30554 EN, Publications Office of the European Union, Luxembourg, 2021, http://dx.doi.org/10.2760/668980%20, JRC122454.

The Copernicus programme — the EU flagship on Earth Observation — routinely provides a variety of exiting new user-oriented products that constantly improve the monitoring of our planetary environment, its climate and the anthropogenic use and impact on it. Over the last decade this resulted in an incrementally growing amount of data and products. The Global Land component of the Land service has been generating many such core variables at global scale and with high time frequency. Product specific and rather unharmonized processing chains were used so far. Building on this experience, we know that combining and integrating production chains into an overarching architecture can lead not only to more harmonized, time and cost-efficient product generation, but also to an improved and integrated use of such data and products. This consequently facilitates the conversion of space-based Earth Observation information into actionable knowledge for a better response to the complex global change processes we are currently dealing with. Technological advances happen quick and now with cloud infrastructures we have the unprecedented means to make such deep integration possible. However, transforming an established operational setup, such as was developed and used for the Global Land Service over the years, to another completely new and technological challenging cloud computing environment is not a trivial job. Especially considering that many production chains need to be decomposed into modular bits and pieces which then have to be newly forged into a smooth and fully integrated machinery to provide the user with a transparent, yet integrated, set of tools. The scope of this report is to tackle exactly this: providing clear suggestions for an efficient ‘cloudification’ of the Copernicus global land production lines and user interfaces, and investigating if there is a tangible benefit and what would be the effort involved.

We are eager to get feedback on these ideas and hear how they resonate with different science communities. Feel free to leave us comments here or follow up via the corresponding authors listed on each paper!

--

--

Ryan Abernathey
pangeo
Editor for

Associate Professor, Earth & Environmental Sciences, Columbia University. https://rabernat.github.io/