Perform Zonal Statistics on Climate Data with Python

Lubomir Franko
9 min readMay 8, 2022

--

How has the climate of your country been changing compared to other countries?

June temperatures are rising across most of the countries in Europe (Image by Author)

Climate change is slowly starting to shape up the future of all of us. In forthcoming decades, we can expect warmer temperatures, long-lasting droughts, stronger hurricanes and many more.

We will not discuss climate change in this article because you can find plenty of discussions regarding this topic around the internet. However, we will explore free data from European Commission and Copernicus that can help us understand how is has the temperature been changing over the years.

In this article, we will perform zonal statistics on multiple datasets. We will visualise monthly mean temperature anomaly and create time series over multiple countries. Finally, we will visualize our findings using heatmap chart.

TLDR: If you don’t care how to obtain the data and set-up environment, jump to Perform Zonal Statistics section and enjoy coding part of the article.

Set Up Environment

Set Up Python Environment

We will start with creation of new python environment using Anaconda. We will use many geospatial libraries like xarray, regionmask or geopandas. We will also use cdsapi to connect to Climate Data Store and obtain the data.

conda create -n climate_change python=3.8 pandas scipy cdsapi xarray geopandas cfgrib regionmask

And activate conda environment:

conda activate climate_change

Set Up CDS API

Next thing would be to correctly set up CDS Python API, which we will use to download data from Climate Data Store. Please proceed according to my last (and first) article.

Please, create new folder where you place all your data and ipython notebook, in which we will perform all the steps below. I created notebook climate_change.ipynb which I will share at the end of this article.

Get the Data That We Use

We will re-use knowledge from my previous article and we will explore Climate Data Store once again.

Get Climate Data

Today, we will work with Essential climate variables for assessment of climate variability from Copernicus. Dataset consists of multiple climate variables such as Surface air temperature, Surface air relative humidity or Precipitation. We then have a choice of Anomaly (Climatology minus Monthly mean), Climatology or Monthly mean product type.

Due to large amount of data, we will compare January and June temperatures over the course of 44 years ranging from 1979 to 2022 (note that we will only have 43 datasets for June, since this article is being written in May 2022). We should have 87 files in the end.

Our python API request will look like this:

import cdsapic = cdsapi.Client()c.retrieve(
'ecv-for-climate-change',
{
'format': 'zip',
'variable': 'surface_air_temperature',
'climate_reference_period': '1991_2020',
'product_type': [
'anomaly', 'monthly_mean',
],
'time_aggregation': '1_month_mean',
'year': [
'1979', '1980', '1981',
'1982', '1983', '1984',
'1985', '1986', '1987',
'1988', '1989', '1990',
'1991', '1992', '1993',
'1994', '1995', '1996',
'1997', '1998', '1999',
'2000', '2001', '2002',
'2003', '2004', '2005',
'2006', '2007', '2008',
'2009', '2010', '2011',
'2012', '2013', '2014',
'2015', '2016', '2017',
'2018', '2019', '2020',
'2021', '2022',
],
'month': ['01','06'],
'origin': 'era5',
},
'download.zip')

Downloading this data can take some amount of time (mine took 1m 24.8s). This depends on number of currently queued queries on CDS side.

Once the data is downloaded, you should unzip it to folder of your choice. In result, you should have 87 grib files placed in your desired folder.

Create/Get Shapefiles

To calculate zonal statistics, we want to calculate statistics over specific area. In our case, the specific areas are European countries. We will use NUTS shapefilea file containing geospatial information about countries of European Union.

We will use shapefile, that was already created by European Commission. However, if you are familiar with GIS systems, you can create one of your own.

Our shapefile dataset with description is available for download here. We will use this configuration and download it.

Configuration for downloading shapefile used in this article

Save it to preferred location, I strongly suggest it being the same directory as the one where we store python notebook.

The NUTS are a hierarchical system divided into 3 levels. NUTS 1: major socio-economic regions, NUTS 2: basic regions for the application of regional policies, NUTS 3: small regions for specific diagnoses.

We will use NUTS1 and in bonus example, we will work with NUTS2.

This is how our NUTS shapefile looks like in QGIS, adding some basic symbology to differentiate between countries:

NUTS dataset visualised in QGIS (Image by Author)

Perform Zonal Statistics

Since we have gathered all of the data that we need, we can continue with our second to last step — Performing Zonal Statistics. What we want to achieve is to compute mean temperature for every element (country) in our shapefile over the course of 44 years for January and 43 years for June, respectively. We will perform computation for product type Anomaly only (you can do Monthly mean on your own after reading this article). We will extract this information to Pandas DataFrame, which we will use to visualise our results in the end.

As a bonus, we will select one country and perform zonal statistics on different regions within the country.

Let’s dive into coding part of this article.

Explore Climate Dataset

import xarray as xr 
import glob
import pandas as pd
# Get list of files with anomaly data using glob
grib_files = glob.glob("data/ecv/temperature/*anomaly*.grib")
# Explore dataset
with xr.open_dataset(grib_files[0], engine="cfgrib") as data:
ds = data
ds

As we can see, every grib file is 2-dimensional dataset with dimensions of latitude and longitude. One way to open multiple files would be to iterate over each grib file and concatenate them somehow.

However, this can be done easily with xarray open_mfdataset() function, where we can specify list of files to open and also specify nesting and concat dimension. In our case, we want to concat data over valid_time, which we can see under coordinates in dataset above. Setting the right arguments for open_mfdataset() will help us open all the files that we need, straight to 3-dimensional dataset. Also, all the opening can be done in parallel, on multiple processors.

# Open multifile dataset with xarray, concat them on "time" 
# dimension
with xr.open_mfdataset(grib_files, combine="nested",concat_dim = "valid_time", parallel=True) as data:
multi_ds = data
multi_ds

Explore Shapefile Dataset

import geopandas as gpd
eu_countries_shp = "data/shapefiles/NUTS_RG_01M_2021_4326.shp"
eu_countries_gdf = gpd.read_file(eu_countries_shp)
eu_countries_gdf.head()

Now, let’s select only NUTS level 1, so we won’t perform calculation on small subregions for each country. Also, we will reset index of dataframe to create unique numeric values for every country. We will need this later on.

eu_countries_gdf = eu_countries_gdf[eu_countries_gdf["LEVL_CODE"] == 0]

Perform Zonal Statistics

To calculate zonal mean over regions, we must firstly mask the dataset with our regions, using library regionmask. We will pass our eu_countries_gdf dataframe (NUTS shapefile), longitude and latitude from our 3-dimensional xarray dataset and we will mark our specific regions with unique number according to index column in eu_countries_gdf dataframe, containing unique integer values for every country.

Then, we will select masked areas and select only specific month (in this case-June).

import regionmask# Create mask of multiple regions from shapefile
eu_mask = regionmask.mask_3D_geopandas(
eu_countries_gdf,
multi_ds.longitude,
multi_ds.latitude,
drop=True,
numbers="index"
)
# Apply mask on our dataset
multi_ds = multi_ds.where(eu_mask)
# Get indexes of months since we have January and June
# Select January (1 : January, 6 : June etc...)
month_idxs = multi_ds.groupby('valid_time.month').groups
ds_1month = multi_ds.isel(valid_time=month_idxs.get(6))
ds_1month

Now we have dataset, that is masked. In the next step we will perform zonal statistics by using groupby and mean methods of dataset.

grouped_ds = multi_ds.groupby("valid_time.year").mean(["latitude","longitude"])
grouped_ds

Voila, we have performed zonal statistics on our dataset!

Now, let’s convert our results to pandas dataframe, perform couple of transformations and visualise our findings:

  1. Join df with eu_countries_gdf so we can retrospectively get information about NUTS_ID, which are string IDs of countries
  2. Drop column time, since in this dataset it is the same as valid_time
  3. Set valid_time as index and sort the data according to ir
  4. Pivot dataframe on column NUTS_ID
  5. Select t2m parameter, rename axis 1 to “Country”
  6. Convert datetime index to integers of years
  7. Rename index to “Year”
df = grouped_ds.to_dataframe()df = df.reset_index("valid_time").join(eu_countries_gdf.set_index("index")[["NUTS_ID"]])
df = df.drop(columns=["time"])
df = df.set_index("valid_time").sort_index()
df = df.pivot(columns=["NUTS_ID"])
df = df["t2m"] # Select temperature (this is useful if we have more parameters in grid, eg. wind, humidity)
df = df.rename_axis("Country", axis=1)
df.index = [value.year for value in df.index]
df.index.name = 'Year'
df.head()

Now, let’s wrap up everything we did into 50 lines of a code.

import geopandas as gpd
import xarray as xr
import regionmask
import glob
import pandas as pd
def prepare_data(multi_ds, month : int = 1) -> pd.DataFrame:

month_idxs = multi_ds.groupby('valid_time.month').groups
multi_ds = multi_ds.isel(valid_time=month_idxs.get(month))
multi_ds = multi_ds.groupby("valid_time.year").mean(["latitude","longitude"])
df = multi_ds.to_dataframe()
df = df.reset_index("valid_time").join(eu_countries_gdf.set_index("index")[["NUTS_ID"]])
df["valid_time"] = df["valid_time"].apply(lambda x: x.year)
df = df.drop(columns=["time"])
df = df.set_index("valid_time").sort_index()
df = df.pivot(columns=["NUTS_ID"])
df = df["t2m"]
df = df.rename_axis("Country", axis=1)
df.index.name = 'Year'
return df
# Get list of files with anomaly data using glob
grib_files = glob.glob("data/ecv/temperature/*anomaly*.grib")
# Open multifile dataset with xarray, concat them on "time" dimension
with xr.open_mfdataset(grib_files, combine="nested",concat_dim = "valid_time", parallel=True) as data:
multi_ds = data

# Open shapefile with geopandas
eu_countries_shp = "data/shapefiles/NUTS_RG_01M_2021_4326.shp"
eu_countries_gdf = gpd.read_file(eu_countries_shp)
eu_countries_gdf = eu_countries_gdf[eu_countries_gdf["LEVL_CODE"] == 0]
eu_countries_gdf = eu_countries_gdf.reset_index()
# Create mask of multiple regions from shapefile
eu_mask = regionmask.mask_3D_geopandas(
eu_countries_gdf,
multi_ds.longitude,
multi_ds.latitude,
drop=True,
numbers="index"
)
# Apply mask on our dataset
multi_ds = multi_ds.where(eu_mask)
january_df = prepare_data(multi_ds, month=1)
june_df = prepare_data(multi_ds, month=6)
january_df.head()

Analyse Findings

We have done quite a lot of work to get the exact shape of the dataframe. However, our hard work will be rewarded. Having dataframe in the format as seen on the screenshot above will make our data easy to visualise. We will use seaborn library to create heatmap.

import matplotlib.pyplot as plt
import seaborn as sns
grid_kws = {"width_ratios": (.95, .03), "hspace": .1}
f, (ax, cbar_ax) = plt.subplots(1,2,figsize=(16,12), gridspec_kw=grid_kws)
ax = sns.heatmap(june_df.T, ax=ax,
cbar_ax=cbar_ax,
square=True, cmap='RdBu_r', vmin=-5, vmax=5, center=0, annot=True, fmt=".0f",
cbar_kws={"orientation": "vertical"})
ax.set_title("Monthly Mean Temperature Anomaly in June in Selected European Countries (°C)", size="18", pad=20)

And our result is:

We can clearly see the trend of rising temperatures over most of the European Countries (Image by Author)

Let’s think about the results. Our horizontal axis represents years in ascending direction. Our vertical axis represents countries labeled by NUTS_ID. Our color scale represents temperature anomaly with negative values being in bluish color, whereas positive values are displayed in reddish colors.

From the image above, we can definitely say, that in most of the European countries, Junes are becoming warmer from decade to decade. While in years between 1980 and 1995 most of the countries had temperatures below normal, last couple of years were on the opposite side, much warmer than climate average (in our case, the climate average is average temperature for each point in a grid calculated over years 1991–2020).

Let’s see the results for January to determine, how has the temperature been changing over this winter month.

The frequency of very cold Januaries is falling (Image by Author)

As we can see on the picture above, trend is not that clear as it is for temperatures in June. However, we can come into a conclusion that in late 80’s there was higher frequency of very cold temperatures over the month January than it is in 2000’s.

Try to play with the dataset on your own and let me know your findings. As a bonus, I also analysed changing temperature over June in regions of France. Bear in mind that resolution of the data is 0.25x0.25°, meaning it is not suitable for analysing very small regions.

Even on regional scale, we can see rising temperature trend over the France. Note the much above average temperature in 2003 (Image by Author)

If you liked this article, don’t hesitate to follow my account. Any comments are welcome and I’d be glad to answer any of your questions. Thank you for your attention and I look forward to explore more data with you.

Stay Connected

  • Follow me on Medium for more stories like this
  • Connect on LinkedIn
  • Find more datasets about Climate Change on CDS

And also, link to the whole notebook as I promised:

--

--

Lubomir Franko

Experienced Data Engineer writing about Python, Spark and Geospatial Analysis