pangeo
Published in

pangeo

Closed Platforms vs. Open Architectures for Cloud-Native Earth System Analytics

By Ryan Abernathey & Joe Hamman

  • The data we want to work with are huge (typical analyses involve several TB at least)
  • The data we need are produced and distributed by many different organizations (NASA, NOAA, ESGF, Copernicus, etc.)
  • We want to apply a wide range of different analysis methodologies to the data, from simple statistics to signal processing to machine learning.
Download-based workflow. From Abernathey, Ryan (2020): Data Access Modes in Science. figshare. Figure. https://doi.org/10.6084/m9.figshare.11987466.v1

Closed Platforms

First let’s enumerate some examples of closed platforms and note what they have in common.

Google Earth Engine

Google Earth Engine (GEE) was the first well-known platform for Big Data earth system science.

Descartes Labs Platform

A new kid on the block is the startup Descartes Labs.

Copernicus Climate Data Store

One ambitious platform by a non-commercial entity is ECMWF’s Copernicus Climate Data Store.

CDS Live Monitoring: https://cds.climate.copernicus.eu/live/

Open Architectures

Open Architecture for scalable cloud-based data analytics. From Abernathey, Ryan (2020): Data Access Modes in Science. figshare. Figure. https://doi.org/10.6084/m9.figshare.11987466.v1
  • A new approach to data sharing, focused on object storage rather than file downloads
  • Scalable, data-proximate computing, as found in cloud platforms
  • High-level analysis tools which allow scientists to focus on science rather than low-level data manipulation steps

OPeNDAP / THREDDS

In geoscience, we have had an excellent remote-data-access protocol for a long time: the “Open-source Project for a Network Data Access Protocol” or OPeNDAP.

ESGF Architecture Diagram. From the 2017 ESGF Brochure.

COG / STAC

As the cloud has emerged as a powerful way to store and process large collections of data, the geospatial imagery community has pioneered a new class of cloud-native geospatial processing tools and data formats. Much of the success in this area can be attributed to the development a new data storage format, the “Cloud Optimized GeoTIFF”.

Pangeo

Pangeo represents our best attempt to implement a cloud-native open architecture solution for climate science and related fields. The key technological elements of Pangeo on the cloud are:

Xarray Dataset. Credit Stephan Hoyer.
  • Xarray — A high-level data model and API for loading, transforming, and performing calculations on multi-dimensional arrays. Datasets in Pangeo (and Xarray) tend to conform to the CF metadata Conventions.
  • A distributed parallel computing framework — Dask — which enables scientists to scale out the computations to huge datasets with minimal changes to their analysis code.
  • A storage format optimized for high throughput distributed reads on multi-dimensional arrays: Zarr. Zarr works well on both traditional filesystem storage and on Cloud Object Storage.
  • Intake — a Python library which helps users navigate data catalogs and quickly load data without getting lost in the details.
  • Jupyter — the interactive computing framework which allows users to interactively control a remote computing kernel, running in a container in the cloud, using their browser.
Pangeo Architecture. From Pangeo NSF Earthcube Proposal (2017), doi:10.6084/m9.figshare.5361094.v1.

Conclusions and Outlook

Closed platforms, such as Google Earth Engine or Descartes, offer the research community an exciting template for how cloud-native Earth System Science could work — no tedious downloads or frustrating data-preparation steps; comprehensive and user-friendly catalogs of relevant datasets; scalable, on-demand processing to quickly burn down Terabytes or Petabytes of data. However, it seems unlikely that the closed platforms can meet the needs of every Earth System scientist, due to their necessarily narrow scope. Furthermore, since industry, rather than academic science, is the main customer for these closed platforms, academic scientists will continue to depend on free credits — this doesn’t feel sustainable or scalable. This isn’t to say that the closed platforms can’t be very valuable for some scientists — just that we can’t expect to rely on them to meet all of our data processing needs across the entire field.

Acknowledgements

This work was supported by NSF award 1740648.

--

--

A community platform for big data geoscience

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store