Tips to handle climate data with Python

Published in

Axionable

14 min readJul 22, 2022

The urge to tackle climate change issues takes hold also in the world of data and AI, where on one hand providers of open climate and environmental data as well as tools to handle them are proliferating and on the other hand a growing community of data analysts, scientists & engineers are approaching these topics.

There are several reasons why you might want to read this article before starting your climate data project: the lack of standards and consensus on guidelines for dealing with such types of data, the limited knowledge available on this topics within the data/AI community, the ongoing implementation of python libraries, and the scarcity of expert literature on the topic.

When you read and process your data directly with python, the technical challenges you could be dealing with range from the data volume that can easily be time-consuming, to the variety of python libraries able to handle partial tasks in the preprocessing of different geographic formats, such as xarray datasets or vector/shapefiles.

Other difficulties arise from the geographic features of the data : understanding and manipulating metadata like the Coordinate Reference System (CRS) can be quite difficult if you are not a specialist in the domain and can bring surprisingly funny plots.

In this article we’ll share some tips to overcome some of the trickiest pain points we bumped into when developing our first python code to read, preprocess and aggregate different types of climate data such as netcdf files for climate projections CMIP5/6 provided by the IPCC, and the grib (raster) files for reanalysis weather data provided by Copernicus — C3S.

In the examples below, we will use open data from two sources:

CMIP5 data after downscaling and biais correction, provided by DRIAS that can be downloaded from DRIAS (French Ministry of the Ecological Transition) after the activation of a free account. In particular, we selected the ALADIN63_CNRM_CM5 model, the RCP4.5 scenario, daily data for the year 2030, and as variable the mean temperature over all the French continental territory (Corsica included).
The dataset with the boundaries of each French municipality, available at data.gouv.fr.

#1 Multiple formats and libraries

There are two main typologies of geographic data, raster and vector data. Vector data contain a geometry object that can be a point, a line or a (multi)-polygon. In any case, each shape is defined by the list of the coordinates (x, y) of its vertices.

On the other hand, raster data are structured in a multidimensional array: a bi-dimensional grid representing a geographic region indexed by a set of coordinates (x, y) and a third dimension containing the value of the variable for each pixel.

The third dimension can contain several layers that we refer to as bands. Each band for each variable (in case of multiple variables) and time (if there is a temporal dimension).

Vector data are stored in the Shapefile format. The peculiarity of this format is that a Shapefile can be opened only if the file with the .shp extension is in the same folder as the associated auxiliary files with the same name and extensions .shx and .dbf.

A Shapefile can contain a set of attributes associated to each shape, but it can contain only one type of geometry (points, lines or polygons) and the geometry object has to named “geometry”.

Raster data are stocked in different formats such as, geotiff, netcdf, grib. A geotiff file is the classical tiff image file format plus some geographic metadata, allowing to associate each pixel to a specific geographic zone. The GRIB format is specific to meteorological (weather) data, while netcdf is specific to climate data, it is a self-describing format, meaning that metadata referring to the geographic features, model experience, provider contacts and so on are stocked in the same file.

There are several libraries able to load and read data stored in those formats. Each one evolves rapidly, has its advantages and limitations, and no particular one stands out from the rest. The choice depends on the processing and operations you have to do with your data, and on how familiar you are with climatic and geographic notions.

1. Geopandas
Geopandas is the Pandas extension to geospatial data. It is by far the standard choice when you work with GeoDataFrame, that are dataframes with a geoseries column, that has to be named “geometry” containing the spatial attributes (point, line, polygon). Remark that the geometry object is a Shapely object. Shapely is another important library when dealing with geometric objects, and we’ll use it later on to transform two series of longitude-latitudes coordinates into a geometric series.

import geopandas as gpd
cities = gpd.read_file(path+'communes-20210101.shp')

2. Xarray
Xarray is a very user-friendly python library to handle multidimensional labeled arrays. It has its origin in the Climate data community, so it is particularly recommended for climatic data. The package includes a large and growing support for advanced visualization with these data structures, and it is particularly tailored to working with netCDF files. The additional dependencies needed are : Matplotlib and netCDF4.

import xarray as xr
climate_xr = xr.open_dataset(path+'tasAdjust_France.nc')
climate_xr

3. Rasterio
Rasterio is among the pioneers of the python libraries to process raster geographical data. Allowing to access and manipulate geographical metadata such as the coordinate reference system, the affine transformation or the units manually. A nice advantage of Rasterio is that it recognizes the tags of each band, so you can easily control one band to which variable/timestamp corresponds. On the other hand, it is a bit less user-friendly, since it doesn’t provide a clear display of the data, and since many parameters and functions have to be set manually, you need to be quite familiar with the geographic information notions.

import rasterio as rio
climate_rio = rio.open(path+'tasAdjust_France.nc')
climate_rio

climate_rio.meta

4. Rioxarray
It is a Xarray extension, with Rasterio accessories, able to treat the geographic features. It is really user-friendly, and it automatizes some process like reprojection that have to be done manually with Rasterio. For these reasons it is the library we chose to use in our project, and that we’ll you later on in this article.

import rioxarray as rxr
climate = rxr.open_rasterio(path+'tasAdjust_France.nc')
climate

Note that you can use Xarray and still be able to access the Rioxarray components just by using df.rio.function().

5. netCDF4
netCDF4 is a python library specific to nc files. It provides all kinds of functions, to write, read and manipulate nc data and metadata, but in our experience we found it a bit less straightforward than the Xarray package.

import netCDF4 as nc
climate_nc = nc.Dataset(path+'tasAdjust_France.nc')
climate_nc

#2 The Coordinate Reference System (CRS)

The coordinate reference system is one of the most important metadata for geographical data, especially when you work with different datasets, or you want to visualize data.

If you want to locate a point on the Earth, you need two coordinates (and a third one if you take into account the altitude). But this is not an easy task, since the Earth is not spherical, it looks more like an ellipsoid with bumps and holes. Hence, it needs to be mathematically modeled. Depending on how you approximate the shape of the Earth (e.g. by a spheroid or an ellipsoid) and on where you place this approximation (e.g. you can decide to fit precisely the region of the globe you are interested in, but likely the rest of the globe will be poorly approximated) you are doing different assumptions, and this gives rise to different Coordinate Reference Systems.

Hence, knowing the longitude and latitude of a point is not enough to unambiguously identify a point, you need to know all the other assumptions specified in the CRS. Likely you don’t need to make those assumptions yourself. You only need to know the CRS you are interested in, or pick one. Each CRS has a name, but what you need to know is the unique EPSG code which is associated with. (Fun fact, ‘EPSG’ stands for European Petroleum Survey Group, the entity who originally created the code registry)

If you are a novice in geographic information, you probably don’t know which CRS to choose. It is worth knowing that there are two main families of CRSs : geographic and projected. Geographic CRSs are based on a 3D model of the Earth, they cover the entire globe, each point is defined by a longitude and a latitude coordinates expressed in angular units (degrees). On the other hand, projected CRSs are based on 2D models, hence they include assumptions on how to project a 3D object into a flat 2D plane. Projected CRSs have linear units (meters), and are more adapted if you are interested in a specific region of the Earth. Note that whenever you want to flatten a 3D object into a 2D surface, some deformations occur, whether in distances, angles or shapes.

https://www.researchgate.net/figure/The-nine-small---scale-map-projections-used-in-the-paired-comparison-test-arranged-by_fig1_298354278

If you want to deeper dive on CRS concepts, we recommend the courses on this link.

There is no right or wrong CRS. The most common global one is the WGS84, corresponding to the ‘EPSG:4326’ code. It is also the CRS of the shapefile we are working with, so we decided to keep it unchanged.

cities.crs

If the data you are working with have no CRS specified, or if you want to convert it into another one, both Geopandas and Rioxarray allow doing this in just one line, just by knowing the EPSG code of the CRS you want to use.

Being mainly interested in French territory for our project, we could have chosen to work with the CRS ‘EPSG:2154’, particularly adapted in France. We preferred to work with lon-lat coordinates, so in the following we’ll keep the original ESPG. But for the sake of completeness let’s see here how the CRS reprojection affects the plots .

cities_reproj = cities.to_crs('EPSG:2154')
fig, axs = plt.subplots(1, 2, figsize=(20,15))
cities.plot(ax=axs[0])
axs[0].set_title('EPSG:4326')
cities_reproj.plot(ax=axs[1])
axs[1].set_title('EPSG:2154')
plt.show()

There is a slight difference that is more visible in the east side of France, and of course in the units in the plot axis (meters vs degrees).

By running the line below, you can see that the nc file we are working with, has no crs in the metadata.

climate.rio.crs

We will use the write_crs() function and we’ll assign the same crs of the shapefile to it.

climate = climate.rio.write_crs(cities.crs)

#3 Data Visualization

Once you open a geographical dataset, one of the first thing you might want to do is to visualize your data on a map.

To plot our data, we used three different kinds of visualization : the most basic is Matplotlib, for static but immediate visualization; then we used the more sophisticated python library Geoviews for interactive and dynamic choropleth maps, and finally we used Data Studio, to incorporate choropleths in our climate projections dashboard. In the following, we’ll zoom only on the first two options, being mainly interested in python approaches for this article.

The first thing you need to do is to check on the CRS of your data, especially if you want to plot two overlapped datasets. If they don’t have the same CRS, they won’t line up and the resulting map would be meaningless.

#Matplotlib

Matplotib functions are integrated both in Geopandas and Xarray, so it is the very first step for visualizing your data in elementary plots.

In addition to the CRS, a couple of things you want to be careful about in order not to plot a defromed of flipped map, is the order (x,y)/(y,x) or (lon,lat)/(lat/lon) that sometimes is transposed, and the spatial reference, especially if you opened your netcdf with Xarray instead of Rioxarray.

You might want to keep an eye also on the units displayed along the axis in the plot. Imshow() for instance automatically uses the dimensions in the plots, and if you want to display your data in the classical (multidimensional) lon-lat grid, you have to do it manually as explained here or here.

plt.imshow(climate['tasAdjust'][0])
plt.show()

If you opened your dataset with Xarray, and then plot with Imshow, you will get a flipped plot on the y-axis. This is because pixels are indexed starting from the upper-left corner, while on the plot axis, the y are increasing from the bottom up.

Before plotting the cities Shapefile, we dropped the overseas French territories, in order to have a more readable plot.

cities = cities.loc[~cities.insee.str.startswith('97')]
cities.plot(column = 'surf_ha', alpha=1,
linewidth=0.5, figsize = (10,10))
plt.title('French municipalities colored by surface extent')
plt.show()

Fun fact, you can also very easily visualize one Polygon of your GeoDataFrame as in the example below.

cities.loc[0].geometry

#Geoviews

Geoviews is a python library build on the HoloViews package with additional geographical tools.

You can quite easily plot your data in fancy interactive and dynamical maps, with the possibility to zoom in and out, to add a background map in different styles (the tile sources), and add interactive widgets. We mainly plotted choropleths, which are some kind of heatmaps, where each region is colored according to the value of some metric.

We will first select one department to display, say the Tarn et Garonne department (code 82) and then we plot it.

tg = cities[cities['insee'].str.startswith('82')]
tg_cartolight = gv.tile_sources.CartoLight()\
        * gv.Polygons(tg, vdims=[('surf_ha',"Municipalities surface"), ('nom', 'Municipality')]).options(
    tools=['hover'], width=800, height=500, xaxis=None, yaxis=None, hover_color="grey", cmap = 'YlOrRd', fill_alpha=0.5, fontscale=1.2, 
     colorbar=True, line_color='grey', line_alpha=0.3,title = 'Tarn et Garonne municipalities by surface')tg_cartolight

If your data cover a larger region than the one you are interested in, you might want to clip it. It means cutting off all the pixels outside the interested area, leading to shorter computational times and readable plots.

You can clip two Shapefiles, or a Raster file over a Shapefile. Again, it is fundamental that the datasets are in the same CRS, otherwise they don’t line up, resulting in an empty intersection.

Geopandas provides different methods to clip two Shapefiles, widely documented.

Geo clipping 2D coordinates

Clipping becomes more challenging when you have a netcdf whose coordinates are bi-dimensional. This means that data are not stocked in an evenly spaces grid, but the longitude-latitude coordinates of each pixel slightly change.

The way metadata are saved (variables names, shapes, types, units…) in a nc file, is designed by the CFC (Climate and Forecast Conventions). According to the last CF convention, lon-lat coordinates are now provided in 2D arrays and the clip function of the latest Rioxarray version is not compatible with 2D coordinates.

If you naively try to clip your nc files using the rio.clip function, this is what you’ll get.

As an example, we selected the city of Paris in the cities Shapefile to clip on.

paris = cities[cities.nom == 'Paris']
clip_fail = climate.rio.clip(paris.geometry, all_touched=True, drop= True)clip_fail

As you can see, in absence of 1D lon-lat values, Rioxarray automatically uses the indexes of the grid of the nc file (the Dimensions) and crosses them with the lon-lat values of Paris. So the output cell, indexed (2.5, 48.5) it’s just a pixel that coincidentally is within the lon-lat extent of Paris.

paris.total_bounds

Be careful, because, as you can see, in this case no warning error arises !

To overcome this problem, we came up with three alternative solutions.

#1 Use the where() function

The first option uses the where() Xarray function. In this case, we first defined the total bounds of the polygon of interest (Paris in our example).

xmin, xmax = paris.total_bounds[0],paris.total_bounds[2]
ymin, ymax = paris.total_bounds[1],paris.total_bounds[3]

Then, the where function allows selecting only the pixels that meet the conditions imposed on the lon-lat coordinates. Note that the where function keeps the size of the original file and fills the values of all the non-selected pixels with NaNs. To avoid this, you need to add the drop = True condition in the where function.

climate_paris = climate.where((climate.lat <= ymax) & (climate.lat >= ymin) & (climate.lon >= xmin) & (climate.lon <= xmax), drop=True)climate_paris

Note that with this method, there are four pixels that are associated with Paris !

#2 Convert to dataframe

The second solution is to convert your nc file to dataframe.

df = climate.to_dataframe()
df = df.drop(['spatial_ref'], axis=1).droplevel(['y','x', 'band']).reset_index().dropna()

Then we use the Point function from Shapely to transform the lon-lat pairs into Point geometric objects, and convert the dataframe into a GeoDataFrame.

from shapely import Point
points = [Point(xy) for xy in zip(df.lon, df.lat)]
df_geo = gpd.GeoDataFrame(df, geometry=points, crs='epsg:4326')
df_geo = df_geo.reset_index().drop('index', axis=1)

Now we can use one of the spatial join functions from Geopandas.

df_tot = paris.sjoin(df_geo, how='left')
df_tot

This solution is quite straightforward, but it could be easily time-consuming, depending on the size of your datasets ! Note that you can easily convert the final GeoDataFrame back into a netcdf file if you need, or save it as a Shapefile.

#3 Convert to GeoDataFrame, lighter version

Our third method consists in defining a function in order to create a new coordinate in the Xarray Dataset representing the masked grid after the clip.
In particular, the function takes as entries a Shapefile, a nc file, the names of the lon-lat coordinates in the nc and the name of the column of the Shapefile you want to export. Then:

We first flatten the 2D longitude-latitude coordinates
We convert them into Point Shapely objects
As in the previous method, we use the sjoin function from Geopandas. The advantage of this approach is that it will result in a shorter computational time than the previous method, since the sjoin takes place on a two columns GeoDataFrame of the size of the nc file, and not on the whole dataset (where the size of the nc grid is multiplied by the number of timestamps and the number of variables (E.g. : the shape of the dataset is (19162, 1) versus (3603280, 5) in the previous method).
Then we create a new coordinate (“nom” in the exemple below), containing the masked grid, with Nan outside the clipped zone.
We return the Xarray Dataset with the new masked coordinate that you can easily condition on to plot, convert or process your Xarray.

def clip_2D_netcdf(shapefile, nc, latitude_name="lat", longitude_name="lon", new_coords_column="nom"):
    x_flat = nc[longitude_name].values.ravel()
    y_flat = nc[latitude_name].values.ravel()
    point_array = gpd.GeoDataFrame(geometry=gpd.points_from_xy(x_flat, y_flat, crs="epsg:4326"))
    geometries_by_point = point_array.sjoin(shapefile, how='left')
    nc.coords[new_coords_column] = (('j', 'i'), geometries_by_point[new_coords_column].values.reshape(nc[latitude_name].shape))
   
    return ncnc_clipped = clip_2D_netcdf(Paris, climate, latitude_name="lat", longitude_name="lon", new_coords_column="nom")nc_clipped

Conclusions

To wrap up, in this article we tried to give some recommendations based on our experience on the most ambiguous our problematic points we encountered when we first approached these types of data and formats. We hope you now have fewer doubts about which library to use, less fear of which CRS is best to choose, and fewer surprises when making plots or clips!

Warm thanks to:
Ismail Erradi, Nancy Ngatcha, Paul Laville, and Jose Sanchez, for collaborating on writing and proofreading this article and for sharing this fun journey into climate data!