Climate Data Science

A Quick Introduction to CMIP6

How to easily access the next generation of climate models with Python.

Willy Hagi
Towards Data Science

--

The Coupled Model Intercomparison Project (CMIP) is a huge international collaborative effort to improve the knowledge about climate change and its impacts on the Earth System and on our society. It’s been going around since the 90s and today we are heading to its sixth phase (CMIP6), which will provide a wealth of information for the next Assessment Report (AR6) of the Intergovernmental Panel on Climate Change (IPCC).

CMIP6 is sponsoring several different groups working on several different scientific questions, from the climates of the distant past to the impacts of deforestation and land-use changes. When finished, the entire project is estimated to release about 20 to 40 petabytes of data from more than 20 climate models. This is what you could call Big Data by excellence, but how could you give a try to all this information?

The MIPs in CMIP6. Image source: Simpkins (2017).

Setting your toolbox

import intake
import xarray as xr
import proplot as plot
import matplotlib.pyplot as plt

Apart from Matplotlib, these packages are not what you usually see around in data science tutorials. On the other hand, these packages are absolutely vital if you want to work with meteorological datasets.

  • Intake: a package to share and load datasets. Here this will be your connection to the Cloud via the intake-esm catalog.
  • Xarray: it’s Pandas for n-dimensional datasets, like the outputs from climate models.
  • Proplot: the next big thing for data visualization in Python. Seriously.
  • Matplotlib: the old and good standard package for data visualization in any Python ecosystem.

Reading the data catalog

# necessary url
url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
# open the catalog
>>> dataframe = intake.open_esm_datastore(url)

Thanks to the Pangeo efforts, now you got access to all the CMIP6 datasets available by taking advantage of the intake-esmpackage. The df variable above works essentially as a common DataFrame you might be familiar with from Pandas, so you can easily check important information like the names of the columns.

>>> dataframe.df.columns
Index(['activity_id', 'institution_id', 'source_id', 'experiment_id', 'member_id', 'table_id', 'variable_id', 'grid_label', 'zstore', 'dcpp_init_year'],dtype='object')

Each one of these columns is named after the controlled vocabulary from the CMIP project and this kind of organization ensures that millions of datasets will be kept neatly, like in a gigantic library. You can read a bit more about this here.

Searching for datasets

After meddling with the vocabulary, it’s very simple to get the dataset you want. Here you’ll go straight to the NCAR’s model monthly near-surface air temperature output of the Historical experiment. A query for this looks like:

>>> models = dataframe.search(experiment_id='historical',
table_id='Amon',
variable_id='tas',
institution_id='NCAR',
member_id='r11i1p1f1')

This search yields an intake_esm.core.esm_datastore data type, which you can use to finally get the dataset you searched for. The variable models give you more information about it, which is basically a dictionary-like structure.

>>> models 
pangeo-cmip6-ESM Collection with 1 entries:
> 1 activity_id(s)

> 1 institution_id(s)

> 1 source_id(s)

> 1 experiment_id(s)

> 1 member_id(s)

> 1 table_id(s)

> 1 variable_id(s)

> 1 grid_label(s)

> 1 zstore(s)

> 0 dcpp_init_year(s)

Finally getting your hands on the data

To do so, you need first to get the dataset out of the dictionary:

>>> datasets = models.to_dataset_dict()
Progress: |███████████████████████████████████████████████████████████████████████████████| 100.0%

--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There are 1 group(s)

This yields another dict but with the difference that now you can get the keys to the dataset with datasets.keys() :

>>> datasets.keys()
dict_keys(['CMIP.NCAR.CESM2.historical.Amon.gn'])
>>> dset = datasets['CMIP.NCAR.CESM2.historical.Amon.gn']

The good news is that dset is an xarray.core.dataset.Dataset straight away, so you can readily use it for anything you might want to do with the powerful Xarray package, which is especially suited to work with gridded meteorological data.

Plotting is always the fun part and you might be familiar with the Cartopy package for geospatial projections and several other applications. However, here you’ll use the new Proplot package for its simplicity and great ease of use. A quick and neat plot looks like:

fig, ax = plot.subplots(axwidth=4.5, tight=True,
proj='robin', proj_kw={'lon_0': 180},)
# format options
ax.format(land=False, coast=True, innerborders=True, borders=True,
labels=True, geogridlinewidth=0,)
map1 = ax.contourf(dset['lon'], dset['lat'], dset['tas'][0,0,:,:],
cmap='IceFire', extend='both')
ax.colorbar(map1, loc='b', shrink=0.5, extendrect=True)plt.show()

And Voilà! A map of near-surface air temperature in a good-looking Robinson projection.

Final words

CMIP6 is, more than ever, readily available for anyone who wants to give a try thanks to the efforts of the Pangeo community. The complex climate models are now accessible to any student, citizen-scientist or full-time scientist with a relatively decent internet connection.

This has a great potential to open new contributions, improve knowledge and help the efforts towards climate resilience and mitigation strategies.

PS: a Jupyter Notebook with the code above is available in this repository.

--

--

Responses (1)