Closed Platforms vs. Open Architectures for Cloud-Native Earth System Analytics
By Ryan Abernathey & Joe Hamman
Anyone working with large-scale Earth System data today faces the same general problems:
- The data we want to work with are huge (typical analyses involve several TB at least)
- The data we need are produced and distributed by many different organizations (NASA, NOAA, ESGF, Copernicus, etc.)
- We want to apply a wide range of different analysis methodologies to the data, from simple statistics to signal processing to machine learning.
The community is waking up to the idea that we can’t simply expect scientists to download all this data to their personal computers for processing.
“The cloud” — here defined broadly as any internet-accessible system which provides on-demand computing and distributed mass storage — is an obvious way to confront this situation. However, two distinct categories of cloud-based solution have emerged in the geoscience space. Closed platforms aim to bring all the data into a central location and provide tools for users to perform analysis on the data. Open architectures assume data will be distributed and seek interoperability between different data catalogs and computational tools. In this post, we’ll review some of the different solutions in each space.
Closed Platforms
First let’s enumerate some examples of closed platforms and note what they have in common.
Google Earth Engine
Google Earth Engine (GEE) was the first well-known platform for Big Data earth system science.
In their own words
Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth’s surface.
Google has invested their vast engineering expertise in building this tool. It is truly an amazing and impressive platform, and has led to many important scientific discoveries. When we first started thinking about Big Earth Data in 2013, we learned about GEE and immediately assumed this would just be the future: we’ll all migrate to GEE. That clearly didn’t happen. While GEE is clearly a valuable tool for many scientists, the fact remains that the platform is closed — it’s not open source. It’s controlled by Google. They decide what it can and cannot do, which datasets are available, how resources are allocated, etc. (And they have certainly been very generous towards the scientific community.) But GEE is fundamentally a platform for satellite imagery. There are many types of Earth System data and analysis methods that are simply not possible in GEE, and there is no obvious way for the scientific community to change this fact. You also can’t install GEE on your own servers.
As closed platforms go, it’s hard to beat GEE. But others are trying.
Descartes Labs Platform
A new kid on the block is the startup Descartes Labs.
The Descartes Labs Platform is the missing link in making real world sensor data — from satellite imagery to weather — useful. We collect data daily from public and commercial sources, clean it, calibrate it, and store it in an easy-to-access catalog, ready for scientific analysis. We also give our customers the ability to throw huge amounts of computational power at that data and tools built on top of the data to make it easier to build models.
Sounds great! The Descartes CEO Mark Johnson has made a convincing argument that “every business needs a data refinery.”
Scientists would also like a data refinery. In fact, we would argue that most scientific research groups operate as a mini data refinery, with the grad students and postdocs doing the dirty work of slurping up nasty, crude data from wherever it lives and consolidating it into usable, analysis-ready data for the benefit of themselves and their research collaborators (~10 other people).
From my reading of their documents, it appears that the Descartes platform consists of a multi-petabyte catalog of remote-sensing data, a Python API to access and process this data, a parallel computing framework to scale out analysis, a JupyterLab front-end, and some custom visualization tools. A major selling point is that Descartes is a native Python application. Following the GEE model, Descartes are opening their platform to researchers.
This may look like a great opportunity to some academics. While we haven’t yet played around with Descartes, we have no doubt it is a powerful, well-engineered system that puts many valuable remote-sensing datasets at your fingertips. We encourage readers to try out their Impact Science program if it seems like a good fit for your research. However, it seems unlikely to me that this program could accommodate every scientist who might want to use it. Furthermore, like GEE, it’s impossible to run the Descartes platform on your own hardware or your own cloud account.
Copernicus Climate Data Store
One ambitious platform by a non-commercial entity is ECMWF’s Copernicus Climate Data Store.
The Climate Data Store infrastructure aims to provide a “one stop shop” for users to discover and process the data and products that are provided through the distributed data repositories. The CDS also provides a comprehensive set of software (the CDS Toolbox) which enables users to develop custom-made applications. The applications will make use of the content of the CDS to analyse, monitor and predict the evolution of both climate drivers and impacts. To this end, the CDS includes a set of climate indicators tailored to sectoral applications, such as energy, water management, tourism, etc. — the Sectoral Information System (SIS) component of the Copernicus Climate Change Service (C3S). The aim of the service is to accommodate the needs of a highly diverse set of users, including policy-makers, businesses and scientists.
So like GEE and Descartes, the CDS has both data and processing, all provided and controlled by a single entity. The data itself are amazing. The ERA5 reanalysis is widely regarded as the most accurate and comprehensive picture of historical weather. But those who wish to use the platform have to use the CDS Toolbox, whose functionality is somewhat limited. Because the CDS Toolbox is not a community open-source project, there is no clear way for a motivated user to enhance its capabilities, beyond opening a support ticket.
Unlike those commercial platforms, the CDS Toolbox has no support for parallel, distributed processing of data. As a result, many advanced users must download the data to their local computing systems. Indeed, “downloading ERA5” is a very common and time-consuming activity for graduate students studying meteorology and climate impacts. Another major role of the CDS is to support these downloads, via a bespoke API.
It’s amazing to have all the PB of data available to the world for free. But if you’ve ever tried this download service, you have probably realized that it can be very slow. This is because a single system has to serve many users simultaneously, with finite hardware and bandwidth. We recently learned on Twitter that the CDS data is backed by a tape storage archive, which explains why some data requests can take many hours or days.
Copernicus clearly sees the large volume of data delivered via the CDS as a metric of its success:
We sincerely applaud Copernicus for its commitment to providing open data and free computing. But despite its popularity (which is ultimately due to the quality of the data itself; CDS is the only official way to get ERA5), we see some real limitations in the CDS model. Like GEE and Copernicus, the CDS aims to be a “one stop shop.” But it is not a cloud-based platform — it’s backed by a fixed and finite set of computers, with all hosting costs borne by the provider. Consequently, users are forced to choose between a compute service with limited features and computational power or a heavily throttled download service. The result, evident at research shops around the world, is that users mirror ERA5 on their local computing systems, creating “dark replicas” of analysis-ready data. Hundreds of labs and institutions are probably mirroring the same hundreds of Terabytes, just so they can compute close to it.
Open Architectures
The alternative to the closed “one-stop-shop” platforms described above is an open network of data and computing services, interoperating over high-bandwidth internet connections. This model alleviates the need for one organization to shoulder the entire infrastructural burden, allowing each to focus on its strengths and stretch its limited budget. NCAR’s Jeff de La Beaujardière recently laid out a compelling vision for such a “Geodata Fabric”:
The article identifies a need for:
- A new approach to data sharing, focused on object storage rather than file downloads
- Scalable, data-proximate computing, as found in cloud platforms
- High-level analysis tools which allow scientists to focus on science rather than low-level data manipulation steps
With these ingredients in place, scientists could re-create much of the convenience found in closed platforms such as GEE or Descartes, but using open source software, distributed data from multiple different organizations, and computing resources provided by a scientist’s institution or funding agency. An open federation of data and computing resources would be more resilient, sustainable, and flexible than a single closed platform. It would also be more compatible with the distributed nature of scientific institutions and funding streams.
So what might this look like in more concrete terms? Below we review three different software stacks which might enable this type of open infrastructure. What all of these tools have in common is the ability to load and process data on demand over the internet, rather than assuming data is already downloaded onto a local hard drive.
OPeNDAP / THREDDS
In geoscience, we have had an excellent remote-data-access protocol for a long time: the “Open-source Project for a Network Data Access Protocol” or OPeNDAP.
The DAP2 protocol provides a discipline-neutral means of requesting and providing data across the World Wide Web. The goal is to allow end users, whoever they may be, to access immediately whatever data they require in a form they can use, all while using applications they already possess and are familiar with. In the field of oceanography, OPeNDAP has already helped the research community make significant progress towards this end. Ultimately, it is hoped, OPeNDAP will be a fundamental component of systems which provide machine-to-machine interoperability with semantic meaning in a highly distributed environment of heterogeneous datasets.
Long before anyone was talking about Cloud and REST APIs, visionary scientists and software developers developed a way of remote accessing data that remains very useful and relevant today. Using OPeNDAP, scientists can avoid pre-downloading large data files and instead can lazily reference a data object on a remote server, triggering downloads only when bytes are actually needed for computation.
An important counterpart to OPeNDAP workflow is a catalog service which helps users discover and access the data they need. A very common solution is the THREDDS Data Sever.
The THREDDS Data Server (TDS) is a web server that provides metadata and data access for scientific datasets, using OPeNDAP, OGC WMS and WCS, HTTP, and other remote data access protocols.
TDS provides the backbone for much for much of the world’s Earth System data. For example, the Earth System Grid Federation operates a global federation of servers which, using peer-to-peer replication, serve dozens of Petabytes of data to the worldwide research community. These data include the CMIP5 and CMIP6 climate model datasets. All the data are served to end users via TDS.
TDS supports both file download mode and OPeNDAP API-based direct access to data. By combining cloud-based processing with OPeNDAP access, ESGF users can basically do cloud-native-style workflows. A detailed example of this sort of workflow is described in the following blog post:
However, one limitation of this workflow is the processing throughput. The figure at left shows the throughput of the UCAR ESGF OPeNDAP service as a function of the number of parallel read processes. We can see that the data rate saturates at around 140 MB/s. While certainly sufficient for many workflows, it’s hard to process many TB of data this way. The throughput of OPeNDAP streaming is limited both by hardware — the limited processing power of the data server, limited outbound bandwidth, etc. — as well as by software — the OPeNDAP protocol was simply not designed with petascale applications and massively parallel distributed processing in mind.
COG / STAC
As the cloud has emerged as a powerful way to store and process large collections of data, the geospatial imagery community has pioneered a new class of cloud-native geospatial processing tools and data formats. Much of the success in this area can be attributed to the development a new data storage format, the “Cloud Optimized GeoTIFF”.
A Cloud Optimized GeoTIFF (COG) is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud. It does this by leveraging the ability of clients issuing HTTP GET range requests to ask for just the parts of a file they need.
Built to support efficient tile-by-tile access to large collections of geospatial imagery, the COG has provided an excellent template for the development of other cloud-optimized data formats (e.g. Zarr).
Like many open source projects, the development and production of COGs has lead to innovation in other areas as well. One example of such innovation is the development of the SpatioTemporal Asset Catalog (STAC).
The SpatioTemporal Asset Catalog (STAC) specification provides a common language to describe a range of geospatial information, so it can more easily be indexed and discovered. A ‘spatiotemporal asset’ is any file that represents information about the earth captured in a certain space and time.
The goal is for all providers of spatiotemporal assets (Imagery, SAR, Point Clouds, Data Cubes, Full Motion Video, etc) to expose their data as SpatioTemporal Asset Catalogs (STAC), so that new code doesn’t need to be written whenever a new data set or API is released.
COGs and STAC provide the building blocks for a flexible and accessible system for geospatial data analysis. STAC provides a system for describing large collections of geospatial data stored in cloud object store and COGs provide efficient access to pieces of those collections without needing to download the data first.
Indeed, COG / STAC provide an excellent template for open architecture. One limitation of this stack, however, is its rather narrow focus on geospatial imagery, which excludes many types of scientific data within Earth System science.
Pangeo
Pangeo represents our best attempt to implement a cloud-native open architecture solution for climate science and related fields. The key technological elements of Pangeo on the cloud are:
- Xarray — A high-level data model and API for loading, transforming, and performing calculations on multi-dimensional arrays. Datasets in Pangeo (and Xarray) tend to conform to the CF metadata Conventions.
- A distributed parallel computing framework — Dask — which enables scientists to scale out the computations to huge datasets with minimal changes to their analysis code.
- A storage format optimized for high throughput distributed reads on multi-dimensional arrays: Zarr. Zarr works well on both traditional filesystem storage and on Cloud Object Storage.
- Intake — a Python library which helps users navigate data catalogs and quickly load data without getting lost in the details.
- Jupyter — the interactive computing framework which allows users to interactively control a remote computing kernel, running in a container in the cloud, using their browser.
Using these basic ingredients, scientists can compose their own end-to-end platforms for big data analytics that rivals any of the closed options. (In fact, the closed platforms all reuse some of these components — particularly Jupyter, which has achieved near universal adoption in the data-science world.) This platform can run in any cloud or on any HPC system. It’s also highly interoperable with the other open stacks such as COG / STAC.
This figure shows how Pangeo scales on Google Cloud Platform. Using a few hundred parallel processes, we can achieve sustained data processing throughputs rates of 10–20 GB/s. Using elastic scaling and preemptible compute nodes, we can do this very cheaply (a couple of dollars to process a few TB).
Pangeo is not nearly as polished or well-documented as GEE and other closed platforms. But what it lacks in polish, it makes up in flexibility and extensibility. Pangeo can run on any cloud, or on virtually any on-premises hardware, and its components can be mixed-and-matched to meet an organization’s particular needs. For example, one organization might prefer to store their data in TileDB rather than Zarr format, or to use Iris rather than Xarray for a data model. No problem.
Most importantly, with Pangeo, any organization can provide data over the internet using a cloud-native, analysis-ready format and allow others to compute on that data at scale. For example, NOAA’s big data project is now providing large volumes of Earth System data in the cloud, spread across AWS, Google Cloud, and Azure.
Pangeo users can process this data directly at very high throughput using their own cloud computing account. No download or “ingestion” required.
Conclusions and Outlook
Closed platforms, such as Google Earth Engine or Descartes, offer the research community an exciting template for how cloud-native Earth System Science could work — no tedious downloads or frustrating data-preparation steps; comprehensive and user-friendly catalogs of relevant datasets; scalable, on-demand processing to quickly burn down Terabytes or Petabytes of data. However, it seems unlikely that the closed platforms can meet the needs of every Earth System scientist, due to their necessarily narrow scope. Furthermore, since industry, rather than academic science, is the main customer for these closed platforms, academic scientists will continue to depend on free credits — this doesn’t feel sustainable or scalable. This isn’t to say that the closed platforms can’t be very valuable for some scientists — just that we can’t expect to rely on them to meet all of our data processing needs across the entire field.
The alternative is to collaborate on building open architectures. We outlined three software stacks — OPeNDAP / TDS, COG / STAG, and Pangeo — which implement the principles of cloud-native open architecture in different and complimentary ways. Using these technologies, it’s possible to assemble a big-data processing system that rivals the power of the closed platforms. Crucially, these open architectures allow the separation of the role of data provider from the role of data consumer / processor. This eliminates the need for any one organization to be a “one-stop shop.” Data providers, like NOAA, can focus their expertise and limited resources on their core asset — their datasets — by providing analysis-ready data in the cloud. Other organizations — university labs, startups, etc. — can deploy their computing next to the data, using their own unique computational tools and environments. This federated model seems more compatible with the nature of academic funding than a single, central platform for everyone.
We’ve painted a rosy picture of open architectures, but many challenges remain to realizing this vision. Because the open architectures rely on open-source software components, they’re vulnerable to a “tragedy of the commons” problem. No one organization “owns” these components the way that Google owns Google Earth Engine. There are no marketing brochures or sales teams to pitch open architectures to high-level decision makers. This lack of centralization can make some institutions reluctant to commit to open infrastructure.
In fact, we believe that the decentralized nature of open-source is key to its sustainability and longevity. A recent analysis of NSF-funded software by Andreas Müller concluded that community software which arose organically from shared needs was much more likely to flourish than software that originated with a grant proposal. A key challenge for funders, therefore, is how to nurture and support these community-initiated tools and to help steer them towards meeting institutional needs.
Another challenge involves cloud computing costs. We have argued that open architectures enable an efficient decentralized mode of big data analytics, whereby different institutions bear the costs of cloud computing for their users on the same shared data. But, at least in academic scientific research, we still don’t have a good model for how to provide cloud computing to scientists. Should we build our own science cloud? Should funding agencies just grant money to researchers to buy cloud credits at the market rate? Or should an intermediary like Cloudbank aggregate and distribute cloud computing credits for the research community? This question ultimately must be resolved by the funding agencies who support scientific infrastructure.
Finally, even if we can figure out a simple model for how to pay for the credits, scientists will still need help deploying cloud-based infrastructure. This is a place where the closed platforms have a big edge. Most individual research groups lack the expertise and time to spin up a Pangeo-style computing environment of their own. One interesting model is the Canadian project Syzygy, which provides a managed JupyterHub service to universities across Canada.
Within the Pangeo project, we are thinking hard about how to take open architecture to the next level. We believe that open-source data science tools, cloud computing, and cloud-based data have the potential to transform Earth System science, ushering in a new era of discovery, efficiency, and productivity. If you have ideas about how to solve the challenges addressed above, we’d love to hear from you.
Acknowledgements
This work was supported by NSF award 1740648.