How to Segment Buildings on Drone Imagery with Fast.ai & Cloud-Native GeoData Tools

Dave Luo
15 min readJul 25, 2019

--

An Interactive Intro to Geospatial Deep Learning on Google Colab

by daveluo (on github and elsewhere)

In this post and the accompanying Google Colab notebook, we’ll learn all the code and concepts comprising a complete workflow to automatically detect and delineate building footprints (instance segmentation) from drone imagery with cutting edge deep learning models.

All you’ll need is a Google account, an internet connection, and a couple of hours to learn how to make the data & model that learns to make something like this:

Building segmentation and classification of completeness in Zanzibar, interactive link: https://alpha.anthropo.co/znz-demo

In modular steps, we’ll learn to…

Preprocess image geoTIFFs and manually labeled data geoJSON files into training data for deep learning:

Input geoTIFF imagery and GeoJSON label files

Create a U-net segmentation model to predict what pixels in an image represent buildings (and building-related features):

Raw segmentation prediction vs actual

Test our model’s performance on unseen imagery with GPU or CPU:

CPU vs GPU inference tile

Post-process raw model outputs into geo-registered building shapes evaluated against ground truth:

Raw model output -> geo-registered building shape predictions-> evaluation against ground truth labels

And along the way, we’ll get familiar with great geospatial data & deep learning tools/resources like:

  • Geopandas: “an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.”
  • Rasterio: “reads and writes geospatial raster datasets”
  • Supermercado: “supercharger for Mercantile” (spherical mercator tile and coordinate utilities)
  • Rio-tiler: “Rasterio plugin to read mercator tiles from Cloud Optimized GeoTIFF dataset”
  • Solaris: “Geospatial Machine Learning Analysis Toolkit” by Cosmiq Works
  • Cloud-Optimized GeoTIFFs (COG): “An imagery format for cloud-native geospatial processing”
  • Spatio-Temporal Asset Catalogs (STAC): “Enabling online search and discovery of geospatial assets”
  • OpenAerialMap: “The open collection of aerial imagery”
  • Fast.ai for geospatial deep learning: “The fastai library simplifies training fast and accurate neural nets using modern best practices” built on the PyTorch framework.

How to get the most out of this tutorial:

The Colab notebook is our main learning resource — working interactively within the GPU-accelerated Colab notebook environment is highly recommended!

Code is organized into modular sections, set up for installation/import of all required dependencies, and executable on either CPU or GPU runtimes (depending on the section). Links to load files generated at each step are also included so you can pick up and start from any section. Inline# comments (& references for further reading) are provided within code cells to explain steps or nuances in more detail as needed. Executing all code cells end-to-end takes <1 hour on GPU.

This Medium post serves as a high-level conceptual walkthrough and maps directly to sections within the Colab notebook. This post works best as a quick overview with handy bookmarks to Colab (see [Colab section link] under each section heading) or viewed side-by-side with the Colab notebook as a code & concept companion set.

This tutorial assumes you have a working knowledge of Python, data analysis with Pandas, making training/validation/test sets for machine learning, and a beginner practitioner’s grasp of deep learning concepts. Or the motivation to gain what knowledge you’re missing by following the ample references linked throughout this post and notebook.

With that as mental prep, let’s do some geospatial deep learning!

Open tutorial notebook and create your own working Colab copy with File > Save a Copy in Drive (recommended option):

Click me to get started!

Or preview notebook non-interactively (not recommended for optimal learning but good for a quick look):

Pre-Processing

Preview and load imagery and labels

[Colab section link]

For this tutorial, we’ll use the Tanzania Open AI Challenge dataset of 7-cm resolution drone imagery and building footprint labels over Unguja Island, Zanzibar, Tanzania. Much thanks to the following organizations for producing, openly licensing, and making this invaluable dataset accessible:

For simplicity of demonstration, we’ll create training and validation data from a single drone image (in cloud-optimized geoTIFF format) and its accompanying ground-truth labels of manually traced building outlines (in GeoJSON format).

We’ll work with imagery and labels from image grid znz001 which covers the northern tip of Zanzibar’s main island of Unguja. Here is a browsable preview of znz001's drone imagery with its accompanying building outline labels, indexed according to the Spatio-Temporal Asset Catalog (STAC) label extension and visualized in an instance of STAC browser:

STAC Browser showing znz001 imagery and labels

After previewing the labeled data and imagery in the browser, we’ll import our geo-processing tools, copy the direct download URLs from the Assets tab of the browser, and test loading them in our notebook.

Draw train and validation areas of interest (AOIs) with geojson.io

[link to Colab section]

Since we are working with a single image, we need to delineate what sub-areas of the image and labels should be used as training versus validation data for model training.

Using geojson.io, we’ll draw ourtrn and val Areas of Interest (AOI) polygons in geojson format and add dataset:trn or dataset:val to the respective polygon properties.

The finished polygons will look something like this in geojson.io:

And here is the exact GeoJSON file I created displayed in github gist:

For demonstration of later steps, I intentionally drew a more complex shape for each AOI but we could have simply drawn adjacent rectangles instead.

Or in more complex cases, we could choose to draw AOIs of smaller sub-areas that don’t encompass the entire image — for instance, if we want to create training data for specific types of environments like dense urban areas or sparsely populated rural areas only or we want to avoid using poorly labeled areas in our training data.

Drawing the AOIs as geoJSON polygons in this way gives us the flexibility to choose exactly what and where our training and validation data represents.

Convert train and validation AOIs to slippy map tile polygons with supermercado and geopandas

[link to Colab section]

In this step, we’ll use supermercado to generate square tile polygons representing all the slippy map tiles at a specified zoom level that overlap the geoJSON training and validation AOIs we created above.

For this tutorial, we’ll work with slippy map tiles of tile_size=256 and zoom_level=19which yields a manageable number of tiles and satisfactory segmentation results without too much preprocessing or model training time.

You could also try setting a higher or lower zoom_level which would generate more or less tiles at higher or lower resolutions respectively.

Here is an example of different tile zoom_levels over the same area of Zanzibar (see the round, white satellite TV dish for a consistently sized visual reference):

Learn more about slippy maps here, here, and here.

Then we’ll merge our supermercado-generated slippy map tile polygons into a GeoDataFrame with geopandas. We’ll also check for and reconcile overlapping train and validation tiles which would otherwise throw off how we evaluate our progress with model training.

Load slippy map tile image from COG with rio-tiler and corresponding label with geopandas

[link to Colab section]

Now we’ll use rio-tiler and the slippy map tile polygons generated by supermercado to test load a single 256x256 pixel tile from our znz001 COG image file. We will also load the znz001 geoJSON labels into a geopandas GeoDataFrame and crop the building geometries to only those that intersect the bounds of the tile image:

Slippy map image tile and building geometries at z=19, x=319380, y=270495

Here is a great introduction to COGs, rio-tiler, and exciting developments in the cloud-native geospatial toolbox by Vincent Sarago of Development Seed:

We’ll then create our corresponding 3-channel RGB mask by passing these cropped geometries to solaris’ df_to_px_mask function. Pixel value of 255 in the generated mask:

  • in the 1st (Red) channel represent building footprints,
  • in the 2nd (Green) channel represent building boundaries (visually looks yellow on the RGB mask display because the pixels overlap red and green+red=yellow),
  • and in the 3rd (Blue) channel represent close contact points between adjacent buildings

Make and save all the image and mask tiles

[link to Colab section]

Now that we’ve successfully loaded one tile image from COG with rio-tiler and created its 3-channel RGB mask with solaris, let’s generate our full training and validation datasets. We’ll write some functions and loops to run through all of our trnand val tiles at zoom_level=19 and save them as lossless png files in the appropriate folders with a filename schema of {save_path}/{prefix}{z}_{x}_{y} so we can easily identify and geolocate what tile each file represents.

Train u-net segmentation model with fastai & pytorch

As our deep learning framework and library of tools, we’ll use the excellent fastai library built on top of PyTorch. For more info:

Download and install fastai, set Colab to GPU runtime if needed

[link to Colab section]

Let’s download, install, and set up fastai v1 (currently at 1.0.55). And if we’re not already on it, let’s reset Colab to a GPU runtime (this removes locally stored files since it switches to a new environment so you will have to re-download and untar the training dataset created in above steps):

In Menu bar — Runtime > Change Runtime Type > Hardware accelerator: GPU

Colab’s free GPUs range from a Tesla K80, T4, or T8 depending on their availability. See the ===Hardware=== section of show_install() for what GPU type and how much GPU memory is available which will affect the batch size and training time.

For all of these GPUs and mem sizes, a batch size of bs=16 at size=256 should train at <2 mins/epoch without encountering out-of-memory issues but if it does comes up, lower the bs to 8.

Set up data

[link to Colab section]

Now we’ll set up our training dataset of tile images and masks created above to load correctly into fastai for training and validation. The code in this step tracks closely with that of fastai course’s lesson3-camvid so please refer to that lesson video and notebook for more detailed and excellent explanation by Jeremy Howard about the code and fastai’s Data Block API.

The main departures from the camvid lesson notebook is the use of filename string parsing to determine which image and mask files comprise the validation data:

# define the valdation set by fn prefix
holdout_grids = ['znz001val_']
valid_idx = [i for i,o in enumerate(fnames) if any(c in str(o) for c in holdout_grids)]

And we’ll subclass SegmentationLabelList to alter the behavior of open_mask and PIL.Image underlying it in order to open the 3-channel target masks as RGB images (convert_mode=’RGB’) instead of default greyscale 1-channel images (convert_mode=’L’).

We’ll also visually confirm that the image files and channels of the respective target mask file are loaded and paired correctly with a display function show_3ch:

Define custom losses and metrics to handle 3-channel target masks

[link to Colab section]

Here we implement some new loss functions like Dice Loss and Focal Loss which have been shown to perform well in image segmentation tasks. Then we’ll create a MultiChComboLoss class to combine multiple loss functions and calculate them across the 3 channels with adjustable weighting.

The approach of combining a Dice or Jaccard loss to consider image-wide context with individual pixel-focused Binary Cross Entropy or Focal loss with adjustable weighing of the 3 target mask channels has been shown to consistently outperform single loss functions. This is well-documented by Nick Weir’s deep dive into the recent SpaceNet 4 Off-Nadir Building Detection top-5 results:

We’ll also adapt our model evaluation metrics (accuracy and dice score) to calculate either a mean score across all channels or for a specified individual channel.

Set up model

[link to Colab section]

We’ll set up fastai’s Dynamic Unet model with an ImageNet-pretrained resnet34 encoder. This architecture, inspired by the original U-net, uses by default many advanced deep learning techniques such as:

We’ll define our MultiChComboLoss function as a balanced combination of Focal Loss and Dice Loss and set our accuracy and dice metrics:

# set up metrics
acc_ch0 = partial(acc_thresh_multich, one_ch=0)
dice_ch0 = partial(dice_multich, one_ch=0)
metrics = [acc_thresh_multich, dice_multich, acc_ch0, dice_ch0]
# combo Focal + Dice loss with equal channel wts
learn = unet_learner(data, models.resnet34,
model_dir='../../models',
metrics=metrics,
loss_func=MultiChComboLoss(
reduction='mean',
loss_funcs=[FocalLoss(gamma=1, alpha=0.95),
DiceLoss()],
loss_wts=[1,1],
ch_wts=[1,1,1])
)

Also note that our metrics displayed during training shows channel-0 (building footprint channel only) accuracy and dice metrics in the right-most 2 columns while the first two accuracy and dice metrics (left-hand columns) show the mean of the respective metric across all 3 channels.

Train model, inspect results, unfreeze & train more, export for inference

[link to Colab section]

First, we’ll fine-tune our Unet on the decoder part only (leaving the weights for the ImageNet-pretrained resnet34 encoder frozen) for some epochs. Then we’ll unfreeze all the trainable weights/layers of our model and train for some more epochs.

We’ll track the valid_loss, acc_..., and dice_... metrics per epoch as training progresses to make sure they continue to improve and we’re not overfitting. And we set a SaveModelCallback which will track the channel-0 dice score, save a model checkpoint each time there’s an improvement, and reload the highest performing model checkpoint file at the end of training.

We’ll also inspect our model’s results by setting learn.model.eval(), generating some batches of predictions on the validation set, calculating and reshaping the image-wise loss values, and sorting by highest loss first to see the worst performing results (as measured by the loss which may differ in surprising ways from visually gauging results).

Pro-tip: display and view your results every chance you get! You’ll pick up on all kinds of interesting clues about your model’s behavior and how to make it better.

Finally, we’ll export our trained Unet segmentation model for inference purposes as a .pkl file. Learn more about exporting fastai models for inference in this tutorial:

Inference on new imagery

[link to Colab section]

With our segmentation model trained and exported for inference use, we will now re-load it as an inference-only model to test on new unseen imagery. We’ll test the generalizability of our trained segmentation model on tiles from drone imagery captured over another part of Zanzibar and in other parts of the world as well as at varying zoom_levels (locations and zoom levels indicated):

We’ll also compare our model inference time per tile on GPU versus CPU:

CPU vs GPU inference time per tile

Post-processing

Predict on a tile, threshold, polygonize, and georegister

[link to Colab section]

For good evaluation of model performance against ground truth, we’ll use another set of labeled data that the model was not trained on. We’ll get this from the larger Zanzibar dataset. Preview the imagery and ground truth labels for znz029 in the STAC browser here:

For demonstration, we’ll use this particular tile at z=19, x=319454, y=270706 from znz029:

Using solaris and geopandas, we’ll convert our model’s prediction as a 3-channel pixel raster output into a GeoJSON file by:

  1. thresholding and combining the 3-channels of pixel values in our raw prediction output into a 1 channel binary pixel mask
  2. polygonizing this binary pixel mask into shape vectors representing the predicted footprint of every building
  3. georegistering the x, y display coordinates of these vectorized building shapes into longitude, latitude coordinates

Evaluate prediction against ground truth

[link to Colab section]

Finally with georegistered building predictions as a GeoJSON file, we can evaluate it against the ground truth GeoJSON file for the same tile.

We’ll clip the ground truth labels to the bounds of this particular tile and use solaris’s Evaluator to calculate the precision, recall, and F1 score. We will also visualize our predicted buildings (in red) against the ground truth buildings (in blue) in this particular tile:

For more information about these common evaluation metrics for models applied to overhead imagery, see the following articles and more by the SpaceNet team:

Ideas to Try for Performance Gains

Congratulations, you did it!

You’ve completed the tutorial and now know how to do everything from producing training data to creating a deep learning model for segmentation to postprocessing and evaluating your model’s performance.

To flex your newfound knowledge and make your model perform potentially much better, try implementing some or all these ideas:

  • Create and use more training data: there are 13 grids’ worth of training data for Zanzibar released as part of the Open AI Tanzania Building Footprint Segmentation Challenge dataset.
  • Change the zoom_level of your training/validation tiles. Better yet, try using tiles across multiple zooms (i.e. z21, z20, z19, z18). Note that with multiple zoom levels over the same imagery, you should be extra careful of overlapping tiles across those different zoom levels. ← test your understanding of slippy map tiles by checking that you understand what I mean here but feel free to message me for the answer!
  • Change the Unet’s encoder to a bigger or different architecture (i.e. resnet50, resnet101, densenet).
  • Change the combinations, weighting, and hyperparameters of the loss functions. Or implement completely new loss functions like Lovasz Loss.
  • Try different data augmentation combinations and techniques.
  • Train for more epochs and with different learning rate schedules. Try mixed-precision for faster model training.
  • Your idea here.

I look forward to seeing what you discover!

Coming Up

If you liked this tutorial, look forward to next ones which will potentially cover topics like:

  • classifying building completeness (foundation, incomplete, complete)
  • inference on multiple tiles and much larger images
  • working with messy, sparse, imperfect training data
  • model deployment and inference at scale
  • examining data/model biases, considerations of fairness, accountability, transparency, and ethics

Curious about more geospatial deep learning topics? Did I miss something? Share your questions and thoughts in the comments so I can add them into this and next tutorials.

Good luck and happy deep learning!

Acknowledgments and Special Thanks to

--

--

Dave Luo

Writing on tech for climate change, sustainable dev, & planetary health in the Anthropocene. Geospatial ML consultant @GFDRR Labs & making things at anthropo.co