TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Zonal Statistics Algorithm with Python in 4 Steps

4 min readNov 10, 2020

--

Photo by Clay Banks on Unsplash

It is a common need to summarize information from a gridded dataset within an irregularly shaped area. While at first glance this may seem simple, reconciling differences between raster (gridded) and vector (polygon) datatypes can quickly become complicated. This article shows how to implement a zonal statistics algorithm in Python in 4 steps.

  1. Load raster data and vector polygons
  2. Rasterize polygon features
  3. Mask input data to polygon extent
  4. Calculate zonal statistics for the polygon extent

1. Load raster data and vector polygons

Start by importing the necessary Python modules.

Now set the file paths for the raster and vector data and use gdal and ogr to load the raster and vector data, respectively. Access the layer which contains the polygon data from the vector data source that was loaded. Then get the GeoTransform information (positions the gridded raster data and specifies cell sizes) and no data value for the raster data.

We’re also going to do a little more setup with gdal and ogr to prepare for determining the overlap between the raster grid and vector polygons. Let's get gdal and ogr drivers to create temporary raster and vector layers in memory and set the name for a temporary shapefile.

2. Rasterize polygon features

Once the raster and vector data are loaded it’s time to determine where they overlap. This will be the most complicated and intensive part of the algorithm. To start, read the first polygon feature from the vector layer. Then start a while loop that will continue as long as there is another feature in the vector layer. Below is the basic way to set up the loop. This loop will be modified as we continue.

We need to create some functions that will help us create a new raster to store the rasterized polygon features. These functions will make more sense when we call them with actual data. For now, add the function definitions at the top of the script. First we need to convert the convert the coordinates of the bounding box containing a polygon feature to cell coordinates, or offsets, (row and column numbers) that to correspond the input raster. The bounding box of a polygon is the rectangle that contains the entire polygon feature, it consists of minimum and maximum X and Y coordinates.

Next write a function that creates a new GeoTransform with the cell offsets that were calculated in the function above. These two functions will allow us to create a new, smaller raster that only covers the area overlapped by a polygon feature.

Now comes the most complicated part of the algorithm: creating a new empty raster in memory and converting the polygon features to rasters. Follow the comments in code snippet below to understand what is accomplished by each line of code. You may also find it useful to review the GDAL and OGR documentation.

3. Mask input data to polygon zones

At this point we have two concurrent (overlapping) numpy arrays: tr_array and r_array. tr_array has a value of 1 that indicates where the polygon has been rasterized. r_array has the values we want to use to calculate zonal statistics for each polygon.

First, make sure that the input raster exists. Next create a mask array. We want to get data from r_array where it does not contain no-data values and where the corresponding value in tr_array equals 1 (or True).

Now we’ll use maskarray to calculate zonal statistics for each polygon feature.

4. Calculate zonal statistics for each polygon extent

Write one more function that takes a number of values as inputs and creates a dictionary of values. Place this function with the others at the top of the zonal statistics script.

Now, calculate the values from maskarray and pass them to the setFeatureStats function. Only do this if the array exists ( is not None). Otherwise, pass the no data value to set the statistics.

Conclusion

Congratulations! You have created a Python zonal statistics script that can be applied to many different applications. The full script is available below. Using a custom zonal statistics algorithm gives you a lot of flexibility for how you use the results. From this point it’s easy to convert the list of dictionaries to pandas dataframe for analysis, export to a csv, or create new fields in the original shapefile and add the statistics directly to the original data. This can speed up, simplify and customize your workflows.

Originally published at https://opensourceoptions.com on November 10, 2020.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.