CMIP6 in the Cloud Five Ways

Ryan Abernathey
pangeo
Published in
9 min readDec 17, 2019

Last week we announced that Pangeo has partnered with Google Cloud to bring CMIP6 climate data to Google Cloud’s Public Datasets Program. You can read more about the dataset the process here.

CMIP6 stands for Couple Model Intercomparison Project Phase 6, a project that is part of the World Climate Research Program.

The concept of cloud computing and cloud-based data is still very new to many scientists. In this post, I’ll try to explain what cloud-based data analysis looks like by running through five different scenarios for interacting with the CMIP6 data. The main point which emerges is that, while cloud data storage certainly does work great as a download service, its true potential is only realized when combined with cloud computing.

Here we are using a very flexible definition of “cloud.” The main thing all these different modes of access have in common is that a) I can run them all from a Jupyter notebook in a browser window on my laptop; and b) they avoid the common pattern of first downloading a bunch of files to our local hard drive. As we have argued in the past, while this can indeed be convenient in the short term, we see it as a sub-optimal workflow. Instead, these cloud-style analysis patterns load data on demand (i.e. “lazily”) over the network. (However, as we will see, not all networks are created equal.) Furthermore, as a side effect of the “no-download” approach, they are completely self-contained and reproducible by anyone, anywhere in the world.

The examples mostly use Python, but we have bonus Julia example at the end. They are stored in this github repository

and can be launched interactively with this binder link: https://binder.pangeo.io/v2/gh/pangeo-data/pangeo-cmip6-examples/master.

Way 1: Use ESGFs Search API and OPeNDAP Service (from your laptop)

In many aspects, the legacy infrastructure through which CMIP data has been distributed for the past decade is already pretty cloud-like. The ESGF (Earth System Grid Federation) operates a global federation of data servers which mirror different parts of the dataset. These servers also offer a REST API to query the data holdings and obtain obtain download links for different data access protocols. One of these protocols is OPeNDAP, a remote data access protocol which sends structured data formats, including the NetCDF-style data of CMIP6, over the internet via HTTP.

These ingredients — the ability to search a data catalog and the ability to load data on demand over HTTP — are the foundations of all these cloud-style data analysis workflow. However, in this case, the “compute” is just my laptop. An example is shown in this notebook.

A search for historical surface air temperatures in a particular model looks something like this

result = esgf_search(activity_id='CMIP', table_id='Amon',
variable_id='tas', experiment_id='historical',
institution_id="NCAR", source_id="CESM2",
member_id="r10i1p1f1")

and takes less than a second. Here result contains a bunch of OPeNDAP URLs like

http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r10i1p1f1/Amon/tas/gn/v20190313/tas_Amon_CESM2_historical_r10i1p1f1_gn_200001-201412.nc

We can open them all at once with Xarray as

ds = xr.open_mfdataset(result[-4:], combine='by_coords')

which results in a Dataset that looks like this:

This all happens very quickly, because the data (437 MB) haven’t been loaded yet, just the metadata. When I do call .load() on our dataset, it takes about 40 seconds to download from the UCAR ESGF node to my laptop on my home network, or about 11 MB/s. We can use the data to make a pretty plot like this.

Pretty easy! The ESGF infrastructure worked great.

However, say we wanted to work with more data, like all the ensemble members from all of the different CMIP6 models? We could try to speed things up by pulling many datasets in parallel, perhaps using a Dask cluster. It turns out that wouldn’t get us very far. It turns out that the OPeNDAP server throughput maxes out at around 140 MB/s, no matter how many requests we issue in parallel. This is understandable! It’s just one server, and its bandwidth to the rest of the internet is limited.

Throughout bandwidth of UCAR ESGF OPeNDAP server in MB/s, as a function of the number of parallel processes used. Note that the throughput saturates at around 140 MB/s.

Way 2: Do the same thing with the Google Cloud Zarr-based data (still from your laptop)

The same data are available in Google Cloud, with just some slight differences in how they are cataloged and stored. What if we just repeat the same analysis with the cloud data, still using our laptop for computing? This is shown in the following notebook:

The principle differences are:

  • Rather than querying a REST API, we just downloaded a 30 MB CSV file of all the cloud datasets and used Pandas to search / filter to find the data we need, i.e.
df.query("activity_id=='CMIP' & table_id == 'Amon' &
variable_id == 'tas' & experiment_id == 'historical'")
  • Rather than accessing data over OPeNDAP, we opened up a Zarr store on Google Cloud Storage.
# get the path to a specific zarr store 
zstore = df_ta_ncar.zstore.values[-1]
# create a mutable-mapping-style interface to the store
mapper = gcs.get_mapper(zstore)
# open it using xarray and zarr
ds = xr.open_zarr(mapper, consolidated=True)

The user experience was very similar — almost identical really — to the legacy ESGF infrastructure. The difference is that no complex web “services” were involved. Both the data catalog and the data itself live directly in Google Cloud Storage, accessed using vanilla HTTP GET calls. In terms of infrastructure, this is a big simplification.

It’s also faster. Loading the same data took me 5 s (or 80 MB/s), on the same computer & network as the example above. That’s pretty close to the 100 MB/s bandwidth I expect on my home network. It’s not surprising that accessing the data on Google Cloud Storage is faster; it’s the same data storage service used to power massive, data-intensive applications like Spotify and Vimeo, built with all of Google’s accumulated expertise and economics of scale.

So Zarr over HTTP is clearly a viable replacement for the OPeNDAP remote data access protocol, allowing us to access CMIP6 data on demand from anywhere on the internet. However, the real potential of cloud storage only comes out when we couple it with cloud computing.

Way 3: Run the Same Code from a Jupyter Notebook Running Inside Google Cloud

Using Binder or a persistent, cloud-based JupyterHub, we can get a Jupyter notebook / lab running inside Google Cloud. This looks and feels basically identical to a local Jupyter instance; however, the actual computer running python is located very close (in terms of network proximity and bandwidth) to the Google Cloud Storage data.

To get a feel for how this works, you can try it yourself using Pangeo’s Binder. Just click this link and them select thebasic_search_and_load.ipynb example notebook. (The binder example is ephemeral and design just to provide demonstrations; any changes you make will disappear after it shuts down. For a more permanent data analysis environment, you can deploy your own JupyterHub or use an existing Pangeo JupyterHub.)

You can also use Google Colab to run the example:

At this scale (~400 MB of data), there are not obvious advantages to being in the cloud. Things run a little faster — 3 s to load the data rather than 5 from my home network. Mostly, it’s just incredibly convenient; this analysis is accessible and reproducible to anyone, anywhere, as long as they have an internet connection. But to really see the performance benefits of the cloud, we need to scale up and work with more data.

Way 4: Use Dask to Scale your Analysis in the Cloud

So far we have been working with a toy example. To go further, we need a more serious calculation. I’ll use one I developed as an example for the NCAR / LDEO CMIP6 Hackathon which assesses how the frequency of extreme rainfall will shift with climate change.

The calculation was inspired by Angie Pendergrass’s work on precipitation statistics (e.g. this paper or this website). It includes:

  • Searching the data catalog and finding all available models (technically source_ids) with 3-hourly precipitation data, historical and ssp585 experiments. (Only four at this point.)
  • Calculating the zonal-mean precipitation histograms using the xhistogram package.
  • Visualizing the changes under a global warming scenario.

Each model analyzed involves about 10 GB of data. To speed things up, we employ Xarray and Dask to automatically parallelize the calculation and distribute it across a cluster of workers. A major focus of the Pangeo project has been to improve the usefulness of Dask for climate science applications, and this demonstration shows the results of this effort.

Screenshot of interactive analysis of CMIP6 in Google Cloud using JupyterLab, Xarray, and Dask. On the left is our notebook. The panels to the right are panels of the Dask Dashboard, showing the task graph, task stream, and progress bar.

You too can try this out by running the binder and then selecting the precip_frequency_change.ipynb notebook.

Using 10 dask workers, I’m able to process the 10 GB of data in about 10 seconds, or 1 GB/s. That’s pretty fast! It’s quick enough for this work to feel “interactive,” allowing us to iterate our analysis many times to refine the calculation.

How fast can we go? The amazing thing about Cloud Storage is that, for I/O-bound workflows, our throughput scales linearly with the number of parallel reads. We have been doing some experiments to test the scaling, and preliminary results are shown below.

Throughput of reading Zarr data from Google Cloud Storage with a Dask Kubernetes cluster, as a function of the number of processes in the cluster.

With 90 Dask workers (each with two threads), we can pull data out of Google Cloud Storage at 16 GB/s. (Recall that the ESGF OPeNDAP node maxed out at 140 MB/s.)

The ability to process many Terabytes or even Petabytes of data quickly is one of the most attractive features of cloud computing.

Way 5: Bonus — Use Julia

All the examples above use Python. There is effort underway to support Zarr in all major programming languages. Julia in particular is catching on rapidly in the climate modeling community, so that seemed like a good place to start. Thanks to the Zarr.jl project, the CMIP6 data are also accessible in Julia. For code examples and an interactive Binder, check out this repo, which re-creates the simple search / load example above using Julia:

CMIP6 Zarr Cloud Data: Surface air temperature snapshot made with Julia

The Julia demo isn’t quite as slick — it requires more lines of code to do the same thing. This mostly due to the lack of an Xarray-like library for high-level labeled-array analysis in Julia. But all the basic elements clearly work.

Summary

There are many different ways to take advantage of the CMIP6 data in the cloud. The simplest is to just use it as a remote data access service. But to really take advantage of the amazing properties of Google Cloud Storage, you have to do parallel, distributed data analysis in the cloud. The Pangeo project and the tools it supports — like Xarray, Dask, Zarr, and Jupyter — make this relatively simple.

To learn more about Pangeo check out our website:

Or visit our Discourse Forum:

Acknowledgements

This work was supported by NSF award 1740648.

--

--

Ryan Abernathey
pangeo
Editor for

Associate Professor, Earth & Environmental Sciences, Columbia University. https://rabernat.github.io/