Analyzing large climate model ensembles in the cloud

Joe Hamman
Oct 9 · 4 min read

Written by Joe Hamman, posted on behalf of the NCAR Science at Scale Team.

The Science at Scale Team at the National Center for Atmospheric Research (NCAR) is excited to announce the release of the Community Earth System Model (CESM) Large Ensemble Numerical Simulation (LENS) dataset published in the Amazon Public Dataset Program (link to dataset). In this blog post, we give a brief overview of 1) the LENS dataset, 2) how you can access the data, and 3) a Binder-ready Jupyter Notebook that reproduces a few key analyses of the LENS dataset — originally presented in the Kay et al. 2015 paper.

Reproduced figure 2 from Kay et al. 2015. Original Caption: “Global surface temperature anomaly (1961–90 base period) for the 1850 control, individual ensemble members, and observations (HadCRUT4; Morice et al. 2012).” The 1850 control run is not shown in the reproduction.

The data

The CESM LENS dataset includes a 40-member ensemble of climate simulations for the period 1920–2100 using historical data (1920–2005) or assuming the RCP8.5 greenhouse gas concentration scenario (2006–2100), as well as longer control runs based on pre-industrial conditions. The data comprise both surface (2D) and volumetric (3D) variables in the atmosphere, ocean, land, and ice domains. The total data volume of the original dataset is ~500TB, which has been stored as ~150,000 individual NetCDF files on disk and magnetic tape. The dataset has been made available through the NCAR Climate Data Gateway for download or via web services. NCAR has copied a subset (currently ~70 TB) of CESM LENS data to Amazon S3 as part of the AWS Public Dataset Program.

To optimize for large-scale analytics we have represented the data as ~275 separate Zarr datasets. We choose to store the data in the Zarr format because it offers a convenient way to store multi-dimensional data in cloud object store and because it easily represents the metadata (attributes, coordinates, dimensions) found in the kind NetCDF files produced by climate models. Additionally, the NetCDF development team at Unidata has recently announced they plan to support Zarr as a storage backend in the NetCDF library soon, so this effort allows us to prototype analysis tools and workflows using this storage paradigm.

Options for accessing the data

As we mentioned above, the dataset is stored as ~275 Zarr stores. The easiest way to access this data is using Python Intake package. We have provided Intake catalogs for all model components and time frequencies (link to catalogs). For example, the code block below shows to load an Xarray Dataset that includes daily surface air temperature.

import intakecat = intake.Catalog('https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/atmosphere/daily.yaml')ds_20C = cat['reference_height_temperature_20C'].to_dask()
display(ds_20C['TREFHT'])
<xarray.DataArray 'TREFHT' (member_id: 40, time: 31390, lat: 192, lon: 288)>
dask.array<shape=(40, 31390, 192, 288), dtype=float32, chunksize=(2, 365, 192, 288)>
Coordinates:
* lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
* lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
* member_id (member_id) int64 1 2 3 4 5 6 7 8 ... 34 35 101 102 103 104 105
* time (time) object 1920-01-01 00:00:00 ... 2005-12-31 00:00:00
Attributes:
cell_methods: time: mean
long_name: Reference height temperature
units: K

The data can also be accessed directly (not using Intake) using libraries like S3FS. For example, the code snippet below returns the same data shown above:

import s3fs
import xarray as xr
s3 = s3fs.S3FileSystem(anon=True)
store = s3fs.S3Map(root='ncar-cesm-lens/atm/daily/cesmLE-20C-TREFHT.zarr', s3=s3)
ds_20C = xr.open_zarr(store)['TREFHT']

Reproducing Kay et al. 2015

The CESM Large Ensemble Project’s main goal was to enable the assessment of climate change in the presence of internal climate variability. The project provided the research community with a new view into the role of natural variability in climate projections. The 2015 BAMS paper by Kay et al. gave an overview of the initial findings of the effort and has since become one of the cornerstone publications in the actively growing field studying large initial-condition climate model ensembles. As an example of how this effort has helped spawn an active area of research, a recent US CLIVAR workshop held at NCAR set out to foster the usage of similar ensembles to advance understanding of natural climate variability, climate change, and their impacts.

With the goal of providing some relatively simple demonstrations of how to use the LENS data on AWS, we set out to reproduce some of the analysis in the original BAMS paper while using this newly published dataset on AWS. We rewrote the analysis tools to produce two of the figures in the Kay et al. paper using Python libraries like Intake, Xarray, and Dask. We have provided this as a Binder-ready repository of Jupyter Notebooks. Click here to run the notebook now on one of Pangeo’s BinderHub deployments.

Screencast reproducing figure 2 from Kay et. al (2015) in a Jupyter Notebook running on https://aws-uswest2-binder.pangeo.io.

We’ve only reproduced two figures from the Kay et al. paper but more could be done. If you’re interested in extending our analysis, pull requests are welcome on the CESM-LENS-AWS GitHub repository.

Conclusions

Climate modeling efforts, like the CESM Large Ensemble and CMIP6, are producing ever-increasing data archives. Rather than keep these archives on restricted access HPC or tape archival systems, there is a growing call to move these datasets to more accessible locations like cloud storage where anyone can access the data in place, without the need to first download the data archive. By moving the CESM LENS data archive to the cloud, we hope to enable both new scientific applications with the LENS data and the development of new tools and new approaches for working with large ensembles of climate data in the cloud.

This effort was coordinated though the NCAR Science at Scale Team and led by Jeff de La Beaujardiere. Specific thanks Anderson Banihirwe, Chi-Fan Shih, Brian Bonnlander, Joe Hamman, Gary Strand, Eric Nienhouse, Seth McGinnis, Kevin Paul, and others from NCAR, as well as Joe Flasher, Ana Pinheiro Privette, Jed Sundwall, and others from Amazon. NCAR is a Federally Funded Research and Development Center, sponsored by NSF.

pangeo

A community platform for big data geoscience

Joe Hamman

Written by

Computational hydrologist and data scientist @NCAR, http://joehamman.com/

pangeo

pangeo

A community platform for big data geoscience

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade